1. 程式人生 > 實用技巧 >ml從零開始k最近鄰居分類器

ml從零開始k最近鄰居分類器

When it comes to solving classification problems via machine learning, there’s a wide variety of algorithm choices available for almost any data type or niche problem that one might be dealing with. These algorithmic choices can be broadly categorized into two groups, which are as follows.

通過機器學習解決分類問題時,幾乎可以處理任何資料型別或細分問題的演算法選擇很多。 這些演算法選擇可以大致分為以下兩類。

  1. Parametric Algorithms: The algorithms belonging to this category rely on algebraic mathematical equations, that use a set of weights and biases (collectively known as parameters), in order to predict the final discrete outcome for a given set of data. The model size, i.e., the total number of trainable parameters in the model, can vary from just a few (such as, in the case of

    traditional machine learning algorithms) to millions or even billions (as generally seen in the case of artificial neural nets).

    引數演算法:屬於此類的演算法依賴於代數數學方程式,該方程式使用一組權重和偏差(統稱為引數),以便預測給定資料集的最終離散結果。 模型的大小,即模型中可訓練引數的總數,可以從幾個(例如,在傳統機器學習演算法的情況下)到數百萬甚至數十億(在人工神經網路的情況下通常可以看到)之間變化。)。

  2. Non-Parametric Algorithms: The classification algorithms in this category are rather unique in the sense that they don’t use any trainable parameters at all

    . This means that in order to generate predictions, unlike their parametric counterparts, the models in this category don’t rely on a set of weights and biases, or have any assumptions regarding the data. Rather, in order to predict, say, the target class of a new data point, the models use some sort of comparison techniques that help them determine a final outcome.

    非引數演算法:根本上講,它們根本不使用任何可訓練的引數,因此該分類演算法非常獨特。 這意味著,為了生成預測,與其引數對應項不同,此類別中的模型不依賴於一組權重和偏差,也不對資料進行任何假設。 相反,為了預測(例如)新資料點的目標類別,模型使用某種比較技術來幫助他們確定最終結果。

Today, in this article, we are going to study one such non-parametric classification algorithm in detail— the K-Nearest Neighbors (KNN) algorithm.

今天,在本文中,我們將詳細研究一種這樣的非引數分類演算法-K最近鄰(KNN)演算法。

This is going to be a project-based guide, where in the first part, we will be understanding the basics of the KNN algorithm. This will then be followed by a project where we will be implementing a KNN model from scratch using basic PyData libraries like NumPy and Pandas, while understanding the mathematical foundations of the algorithm.

這將是一個基於專案的指南,在第一部分中,我們將瞭解KNN演算法的基礎。 然後是一個專案,在該專案中,我們將使用基本的PyData庫(如NumPyPandas)從頭開始實現KNN模型,同時瞭解演算法的數學基礎。

So buckle up, and let’s get started!

繫好安全帶,讓我們開始吧!

KNN分類器基礎知識 (KNN Classifier Basics)

Image for post
KNN Classification (Image by author)
KNN分類(作者提供)

To begin with, the KNN algorithm is one of the classic supervised machine learning algorithms that is capable of both binary and multi-class classification. Non-parametric by nature, KNN can also be used as a regression algorithm. However, for the scope of this article, we will only focus on the classification aspect of KNN.

首先,KNN演算法是經典的有監督的機器學習演算法之一,能夠同時進行二進位制多類分類。 KNN本質上是非引數的,也可以用作迴歸演算法。 但是,對於本文的範圍,我們僅關注KNN的分類方面。

KNN classification at a glance-

KNN分類一目瞭然-

→ Supervised algorithm

→監督演算法

→ Non-parametric

→非引數

→ Used for both regression and classification

→用於迴歸和分類

→ Support for both binary and multi-class classification

→支援二進位制和多類分類

Before we move any further, let us first break down the definition and understand a few of the terms that we came across.

在繼續進行下一步之前,讓我們首先分解一下定義並瞭解我們遇到的一些術語。

  • KNN is a “supervised” algorithm- In layman's terms, this means that the data used for training a KNN model is a labeled one.

    KNN是一種“監督”演算法-用外行的話來說,這意味著用於訓練KNN模型的資料被標記為一個。

  • KNN is used for both “binary” and “multi-class classification”- In the machine learning terminology, a classification problem is one where, given a list of discrete values as possible prediction outcomes (known as target classes), the aim of the model is to determine which target class a given data point might belong to. For binary classification problems, the number of possible target classes is 2. On the other hand, a multi-class classification problem, as the name suggests, has more than 2 possible target classes. A KNN classifier can be used to solve either kind of classification problems.

    KNN既用於“二進位制”分類,也用於“多分類”。在機器學習術語中,分類問題是一種分類問題,其中給定離散值列表作為可能的預測結果(稱為目標分類),模型是確定給定資料點可能屬於哪個目標類別。 對於二元分類問題,可能的目標類別數目為2。另一方面,顧名思義,多分類問題具有兩個以上的可能目標類別。 KNN分類器可用於解決任何一種分類問題。

With that done, we have a rough idea regarding what KNN is. But now, a very important question arises.

完成之後,對於KNN是什麼,我們有了一個大概的想法。 但是現在,出現了一個非常重要的問題。

KNN分類器如何工作? (How does a KNN Classifier work?)

As we read earlier, KNN is a non-parametric algorithm. Therefore, training a KNN classifier doesn’t require going through the more traditional approach of iterating over the training data for multiple epochs in order to optimize a set of parameters.

如前所述,KNN是一種非引數演算法。 因此,訓練KNN分類器不需要經過更傳統的方法來迭代多個時期的訓練資料以優化一組引數。

Rather, the actual training process in the case of KNN is quite the opposite. Training a KNN model involves simply fitting (or saving) all the training data instances into the computer memory at the same time, which technically requires a single training cycle.

相反,在KNN的情況下,實際的培訓過程恰恰相反。 訓練KNN模型涉及簡單地將所有訓練資料例項同時擬合(或儲存)到計算機記憶體中,這在技術上需要一個訓練週期。

After this is done, then during the inference stage, where the model has to predict the target class for a completely new data point, the model simply compares this new data with the existing training data instances. Then finally, on the basis of this comparison, the model assigns this new data point to its target class.

完成此操作之後,然後在推理階段(模型必須為全新的資料點預測目標類別),模型將此新資料與現有訓練資料例項進行比較。 最後,在此比較的基礎上,模型將該新資料點分配給其目標類。

But now another question arises. What exactly is this comparison that we are talking about, and how does it occur? Well quite honestly, the answer to this question is hidden in the name of the algorithm itself — K-Nearest Neighbors.

但是現在出現了另一個問題。 我們所說的比較到底是什麼?它是如何發生的? 坦率地說,這個問題的答案隱藏在演算法本身的名稱中-K-Nearest Neighbors

To understand this better, let us dive deeper into how the inference process works.

為了更好地理解這一點,讓我們更深入地研究推理過程的工作原理。

  • As the first step, our KNN model calculates the distance of this new data point from every single data point within the ‘fitted’ training data.

    第一步,我們的KNN模型 計算新資料點與“擬合”訓練資料中每個單個數據點的距離。

  • Then, in the next step, the algorithm selects ‘k’ number of training data points that are closest to this new data point in terms of the calculated distance.

    然後,在下一步中,該演算法根據計算出的距離選擇最接近此新資料點的'k'個訓練資料點。
  • Finally, the algorithm compares the target label of these ‘k’ points that are the nearest neighbors to our new data point. The target label with the highest frequency among these k-neighbors is assigned as the target class to the new data point.

    最後,演算法會比較與新資料點最接近的“ k”個點的目標標籤。 在這些k個鄰居中頻率最高的目標標籤被指定為新資料點的目標類別。

And that’s how the KNN classification algorithm works.

這就是KNN分類演算法的工作方式。

Regarding the calculation of the distances between the data points, we will be using the Euclidean distance formula. We will be understanding this distance calculation in the next section where we will be code our own KNN-based machine learning model from scratch.

關於資料點之間的距離的計算,我們將使用歐幾里得距離公式。 我們將在下一部分中瞭解距離計算,從頭開始編寫我們自己的基於KNN的機器學習模型。

So now onto the fun, practical part! We will begin by having a quick glance at the problem statement that we are addressing via our project.

因此,現在進入有趣,實用的部分! 我們將快速瀏覽一下通過專案解決的問題陳述。

瞭解問題陳述 (Understanding the Problem Statement)

For this project, we will be working on the famous UCI Red Wine Dataset. The aim of this project is to create a machine learning solution that can predict the quality of a red wine sample.

對於這個專案,我們將研究著名的UCI紅酒資料集。 該專案的目的是建立一個可以預測紅酒樣品質量的機器學習解決方案。

This is a multi-class classification problem. The target variable, i.e., the ‘quality’ of the wine, accepts a discrete integer value ranging from 0–10, where a quality score of 10 denotes a wine of the highest quality standards.

這是一個多類分類問題。 目標變數(即葡萄酒的“質量”)接受介於0到10之間的離散整數值,其中質量得分10表示質量最高的葡萄酒。

Now that we have understood the problem, let us begin with the project by importing all the necessary project dependencies, which includes the necessary PyData modules and the dataset.

既然我們已經瞭解了問題,那麼讓我們從匯入所有必要的專案依賴項(包括必要的PyData模組和資料集)開始專案。

匯入專案依賴項 (Importing Project Dependencies)

In the first step, let us import all the necessary Python modules.

第一步,讓我們匯入所有必要的Python模組。

Now, let’s import our dataset.

現在,讓我們匯入資料集。

Image for post
A quick glance at the dataset (image by author)
快速瀏覽資料集(作者提供的圖片)

Now that we have imported the dataset, let us try to understand what each of the columns in our data denotes.

現在我們已經匯入了資料集,讓我們嘗試瞭解資料中每個列的含義。

瞭解資料 (Understanding the Data)

The following is a brief description of all the individual columns within our dataset.

以下是我們資料集中所有單獨列的簡要說明。

Image for post
Description of the dataset (image by author)
資料集描述(作者提供的影象)

As we discussed earlier, the ‘quality’ column is the target variable for this project. The rest of the columns in the dataset represent the feature variables that will be used for training the model.

正如我們之前討論的,“質量”列是該專案的目標變數。 資料集中的其餘列表示將用於訓練模型的特徵變數。

Now that we know what the different columns in our dataset represent, let us move on to the next section where we will be doing some pre-processing and exploration of our data.

現在我們知道資料集中的不同列代表什麼,讓我們繼續進行下一部分,在該部分我們將對資料進行一些預處理和探索。

資料整理和EDA (Data Wrangling and EDA)

Data wrangling (or preprocessing) involves analyzing the data to see if it needs any sort of cleaning or scaling so that it can be prepared for training the model.

資料整理(或預處理)涉及分析資料以檢視其是否需要任何形式的清理或縮放,以便為訓練模型做好準備。

As the first step of data preprocessing, we will check if there are any null values within our data that need to be dealt with.

作為資料預處理的第一步,我們將檢查資料中是否有任何空值需要處理。

Image for post
Column description and null value count (image by author)
列說明和空值計數(作者提供的影象)

As we can see, there are no null values within our dataset. This is a good thing since we won’t have to deal with any missing data. Now, let us have a look at the statistical analysis of the data.

如我們所見,資料集中沒有空值。 這是一件好事,因為我們不必處理任何丟失的資料。 現在,讓我們看一下資料的統計分析。

Image for post

One prominent observation from the above given statistical analysis is that there’s a visible inconsistency in the range of values across different columns within our dataset. To be more clear, the values in some columns are of the order 1e-1 while in a few others, the values can go as high as of the order 1e+2. Because of this inconsistency, there are chances that a feature weight bias might arise at the time of training the model. What is basically means is that some features might end up affecting the final prediction more than the others. Therefore, in order to prevent this weight imbalance, we will have to scale our data.

根據上述統計分析得出的一個突出發現是,我們資料集中不同列之間的值範圍存在明顯的不一致。 更清楚的是,某些列中的值約為1e-1,而另一些列中的值可能高達1e + 2 。 由於這種不一致,在訓練模型時可能會出現特徵權重偏差。 基本上,這意味著某些功能可能最終會比其他功能對最終預測的影響更大。 因此,為了防止這種重量失衡,我們將不得不縮放資料。

For this scaling, we will be standardizing our data. Standardization typically means rescaling data in a way such that each feature column has a mean of 0 and a standard deviation of 1 (unit variance).

對於這種縮放,我們將對資料進行標準化。 標準化通常意味著以某種方式重新縮放資料,以使每個要素列的平均值為0,標準偏差為1(單位方差)。

The following is the mathematical formula for standard scaling.

以下是用於標準縮放的數學公式。

Image for post

Now that we know the mathematical formula, let us go ahead and implement this from scratch in Python.

現在我們知道了數學公式,讓我們繼續在Python中從頭開始實現它。

Step-1: Separating the feature matrix and the target array.

步驟1 :分離特徵矩陣和目標陣列。

Step-2: Declaring the standardization function.

步驟2 :宣告標準化功能。

Step-3: Performing standardization on the feature set.

步驟3 :對功能集執行標準化。

Image for post
Standardized data (image by author)
標準化資料(作者提供的圖片)

With this, we are done with standardized our data. This will most probably take care of weight bias.

這樣,我們就完成了資料標準化。 這很可能會減輕體重。

Now for the last part of our data wrangling and EDA section, we will have a look at the distribution of values across the target column of our dataset.

現在,在資料整理和EDA部分的最後一部分,我們將瞭解資料集目標列中的值分佈。

Image for post
Target label counts (Image by author)
目標標籤計數(作者提供的影象)

Some of the observations from the above-given graph are-

上圖給出的一些觀察結果是:

  • Most wine samples in our data are rated 5 and 6, followed by 7.

    我們資料中的大多數葡萄酒樣品的評級為5和6,其次是7。
  • No wine sample is rated above 8 or below 3. This implies that a wine sample of extremely high quality (9 or 10) or very low quality (0, 1 or 2) rating can either be thought of as a hypothetically ideal situation, or that probably that data is suffering from sampling bias, where the samples of extreme quality didn’t get any representation within the survey.

    沒有任何葡萄酒樣品的評分高於或低於3。這意味著質量極高(9或10)或質量非常低(0、1或2)的葡萄酒樣品可以被認為是一種理想的情況,或者這可能是由於樣本偏差造成的,即極端質量的樣本在調查中沒有得到任何代表。

  • Our last assumption of a sampling bias within the data is further strengthened as we notice that a majority of wine samples are rated 5 or 6.

    由於我們注意到大多數葡萄酒樣品的評級為5或6,因此我們對資料中取樣偏差的最後一個假設進一步得到了加強。
  • A model trained on such data will produce biased results, where it is more likely to classify a wine sample as a 5 or 6 on the quality scale as compared to, say, 3 or 8. To know more about sampling bias or how to deal with it, check out this article of mine.

    根據此類資料訓練的模型會產生偏差的結果,在這種情況下,與3或8相比,更可能將葡萄酒樣品按質量等級分類為5或6。要了解更多有關取樣偏差或如何處理的資訊使用它,檢視我的這篇文章

Now that we are done exploring our data, let us move on to the final part of our project where we will be code our multi-class classifier KNN model using NumPy.

現在,我們已經完成了對資料的探索,讓我們繼續進行專案的最後部分,在該部分中,我們將使用NumPy對我們的多類分類器KNN模型進行編碼。

建模與評估 (Modeling and Evaluation)

As the first step of our modeling process, we will first split our dataset into training and test sets. This is done because training and evaluating your model on the same data is considered a bad practice.

作為建模過程的第一步,我們將首先將資料集分為訓練集和測試集。 這樣做是因為在相同資料上訓練和評估模型被認為是不好的做法。

Let’s see how to implement the code to split the dataset using Python.

讓我們看看如何實現使用Python分割資料集的程式碼。

Step-1: Declaring the split function.

步驟1:宣告split函式。

Step-2: Running the splitting function on our standardized dataset.

步驟2:在標準化資料集上執行拆分功能。

Image for post
Shape of the splits (Image by author)
裂縫的形狀(作者提供的圖片)

Now that we have created our training and validation sets, we will finally see how to implement the model.

現在,我們已經建立了訓練和驗證集,我們最終將看到如何實現模型。

On the basis of what we have learned till now, here are the steps involved in creating a KNN model.

根據到目前為止的經驗,這裡是建立KNN模型的步驟。

Step-1: Training the model- As we read earlier, in the case of KNN, this simply means saving the training dataset within the memory. We have already done that when we created our training and validation data splits.

步驟1:訓練模型-正如我們之前所讀,對於KNN,這只是意味著將訓練資料集儲存在記憶體中。 在建立培訓和驗證資料拆分時,我們已經做到了。

Step-2: Calculating the distance- A part of the inference process in the KNN algorithm, the process of calculating the distance is an iterative process where we calculate the Euclidean distance of a data point (basically, a data instance/row) in the test data from every single data point within the training data.

步驟2:計算距離-A KNN演算法中推理過程的一部分,計算距離的過程是一個迭代過程,在此過程中,我們從測試資料中的每個單個數據點計算資料點(基本上是資料例項/行)的歐幾里得距離訓練資料。

Image for post
Calculation of distance (Image by author)
距離的計算(作者提供)

Now, let us understand how the Euclidean distance formula works so that we can implement it for our model.

現在,讓我們瞭解歐幾里得距離公式的工作原理,以便可以在模型中實現它。

  • Let us consider two data points A and B.

    讓我們考慮兩個資料點A和B。

→ A = [a0, a1, a2, a3,…. an], where ai is a feature that represents the data point A.

→A = [A 0,A 1,A 2,A 3,...。 [ n],其中a是代表資料點A的特徵。

Similarly, B= [b0, b1, b2, b3,…. bn].

同樣, B = [ b 0, b 1 ,b 2 ,b 3 ,…。 b n]。

Therefore, the Euclidean distance between the data points A and B is calculated using the following formula-

因此,資料點A和B之間的歐式距離是使用以下公式計算的:

Image for post
Euclidean distance formula (image by author)
歐幾里德距離公式(作者提供的圖片)

Let’s now implement this distance calculation step in Python.

現在讓我們在Python中實現此距離計算步驟。

  • Step-2.1: Declaring Python function to calculate Euclidean distance between 2 points.

    步驟2.1 :宣告Python函式以計算2點之間的歐幾里得距離。

  • Step-2.2: Declaring a Python function to calculate the distance from each point in the training data

    步驟2.2 :宣告一個Python函式來計算訓練資料中每個點的距離

Before we move on to the next step, let us test our distance function.

在繼續下一步之前,讓我們測試距離函式。

Image for post

As we can see, our distance function successfully calculated the distance of the first point in our test data from all the training data points. Now we can move on to the next step.

如我們所見,我們的距離函式成功地計算出了所有訓練資料點中測試資料中第一個點的距離。 現在,我們可以繼續下一步。

Step-3: Selecting k-nearest neighbors and making the prediction- This is the final step of the inference stage, where the algorithm selects the k-closest training data points to the test data point based on the distances calculated in step 2. Then, we consider the target labels of these k-nearest neighboring points. The label with the highest occurring frequency is assigned as the target class to the test data point.

步驟3:選擇k個最近的鄰居並進行預測-這是推理階段的最後一步,其中演算法根據步驟2中計算出的距離,選擇距測試資料點最近的k個訓練資料點。 ,我們考慮這些k最近鄰點的目標標籤。 出現頻率最高的標籤被指定為測試資料點的目標類別。

Let us see how to implement this in Python.

讓我們看看如何在Python中實現這一點。

Step-3.1: Defining the KNN Classification function.

步驟3.1 :定義KNN分類功能。

Step-3.2: Running inference on our test dataset.

步驟3.2 :在測試資料集上進行推斷。

Image for post
Array of predicted values (Image by author)
預測值陣列(作者提供的圖片)

With this, we have completed the modeling and inference process. As a final step, we will evaluate our models’ performance. For this, we will be using a simple accuracy function that calculates the number of correct predictions made by our model.

至此,我們完成了建模和推理過程。 最後,我們將評估模型的效能。 為此,我們將使用一個簡單的精度函式來計算我們的模型做出的正確預測的數量。

Let’s have a look at how to implement the accuracy function in Python.

讓我們看一下如何在Python中實現準確性函式。

Step-1: Defining the accuracy function.

步驟1 :定義精度功能。

Step-2: Checking the accuracy of our model.

步驟2 :檢查模型的準確性。

Image for post
Initial model accuracy
初始模型精度

Step-3: Comparing with the accuracy of a KNN classifier built using the Scikit-Learn library.

步驟3 :與使用Scikit-Learn庫構建的KNN分類器的準確性進行比較。

Image for post
Sklearn accuracy with the same k-value as scratch model
Sklearn精度與刮擦模型具有相同的k值

An interesting observation here! Though our model didn’t perform very well (with only 57% correct prediction), it has the exact same accuracy as a Scikit-Learn KNN model. This means the model that we defined from scratch was least able to replicate the performance of a pre-defined model, which is an achievement in itself!

這裡有趣的觀察! 儘管我們的模型表現不佳(只有57%的正確預測),但其準確性與Scikit-Learn KNN模型完全相同。 這意味著我們從頭開始定義的模型最不可能複製預定義模型的效能,這本身就是一項成就!

However, I believe we can further improve the model’s performance to some extent. Therefore, as the last part of our project, we will find the best value of the hyperparameter ‘k’ for which our model gives the highest accuracy.

但是,我相信我們可以在某種程度上進一步改善模型的效能。 因此,作為專案的最後一部分,我們將找到模型為其提供最高準確性的超引數“ k”的最佳值。

模型優化 (Model Optimization)

Before we actually go on to finding the best k-value, first, let us understand the importance of the k-value in the K-Nearest Neighbors algorithm.

在我們繼續尋找最佳k值之前,首先,讓我們瞭解k值在K最近鄰居演算法中的重要性。

  • The k-value in the KNN algorithm determines the number of training data points that are to be considered while determining the class of the test data points.

    KNN演算法中的k值確定了確定測試資料點的類別時要考慮的訓練資料點的數量。
  • Impact of a low k-value: If the k-value is very low, say, 1 or 2, the model will become very sensitive to outliers in the data. Outliers can be defined as the extreme instances within the data that do not follow the general trends within the data. Because of this, the predictions of the model become very unstable.

    低k值的影響:如果k值非常低,例如1或2,則模型將對資料中的異常值變得非常敏感。 異常值可以定義為資料中的極端例項,這些極端例項不遵循資料中的總體趨勢。 因此,模型的預測變得非常不穩定。

  • Impact of a high k-value: Now, as the k-value in the KNN algorithm increases, a weird trend is observed. At first, an increase is observed in the stability of the algorithm. One reason for this can be that as we consider more neighbors for predicting the target class of a test data point, because of majority voting, the effect of outliers decreases. However, as we continue to increase the k-value, after a certain point, we start to observe a decline in the stability of the algorithm, and the model accuracy starts to deteriorate.

    高k值的影響:現在,隨著KNN演算法中k值的增加,觀察到一個奇怪的趨勢。 首先,觀察到演算法穩定性的增加。 原因之一可能是,由於我們考慮使用更多的鄰居來預測測試資料點的目標類別,因此由於多數表決,離群值的影響減小了。 但是,隨著我們繼續增加k值,在某個點之後,我們開始觀察到演算法穩定性的下降,並且模型準確性開始下降。

The following graph roughly represents the relation between the k-value and stability of a KNN classifier model.

下圖大致表示K值分類器模型的k值與穩定性之間的關係。

Image for post
k-value vs stability (Image by author)
k值與穩定性(作者提供的圖片)

Now, let us finally evaluate the model for a range of different k-values. The one with the highest accuracy will be chosen as the final k-value for our model.

現在,讓我們最終評估模型的一系列不同k值。 精度最高的模型將被選為模型的最終k值。

Image for post
Model accuracy w/ different k-values (Image by author)
具有不同k值的模型精度(作者提供的圖片)

As we can see, k-value 1 has the highest accuracy. But as we discussed earlier, for k=1, the model will be very sensitive to outliers. Hence, we will go with k=8 which has the second-highest accuracy. Let us observe the results with k=8.

如我們所見,k值1的準確性最高。 但是正如我們前面討論的,對於k = 1,該模型對異常值非常敏感。 因此,我們將使用k = 8,它具有第二高的精度。 讓我們觀察k = 8的結果。

Image for post

As you can see, we got a performance boost here! Just by tweaking the hyperparameter ‘k’, our model’s accuracy bumped up by almost 3 percent.

如您所見,我們在這裡獲得了效能提升! 僅通過調整超引數“ k”,我們模型的精度就提高了近3%。

With this, we come to the end of our project.

這樣,我們就結束了我們的專案。

To finish things off, let us have a quick rundown of all that we learned today, as well as some of the key takeaways from this lesson.

最後,讓我們快速總結一下我們今天所學到的所有知識,以及本課中的一些重要內容。

結論 (Conclusions)

In this article, we had an in-depth analysis of the K-Nearest Neighbors classification algorithm. We understood how the algorithm uses Euclidean distance between the data instances as a criterion of comparison, on the basis of which it predicts the target class for a particular data instance.

在本文中,我們對K最近鄰居分類演算法進行了深入分析。 我們瞭解了該演算法如何使用資料例項之間的歐式距離作為比較標準,並以此為基礎預測特定資料例項的目標類別。

In the second part of this guide, we went through a step-by-step process of creating a KNN classification model from scratch, primarily using Python and NumPy.

在本指南的第二部分中,我們逐步進行了從頭開始建立KNN分類模型的逐步過程,主要使用Python和NumPy。

Though our model was not able to give a stellar performance, at least we were able to match the performance of a predefined Scikit-Learn mode. Now, while we were able to increase the model’s accuracy up to 60% via hyperparameter optimization, the performance was still not apt.

儘管我們的模型無法提供出色的效能,但至少我們能夠匹配預定義的Scikit-Learn模式的效能。 現在,儘管我們可以通過超引數優化將模型的精度提高到60%,但效能仍然不合適。

This has a lot to do with how the data was structured. As we observed earlier, there was a huge sampling bias in the data. This certainly affected our model’s performance. Another reason for the poor performance can be that probably, the data had a large number of outliers. All this brings us to a very important part that I left for the very end, where we will have a look at the advantages and disadvantages of the KNN classification algorithm.

這與資料的結構有很大關係。 正如我們之前觀察到的,資料中存在巨大的取樣偏差。 這無疑影響了我們模型的效能。 效能不佳的另一個原因可能是資料中有大量異常值。 所有這些將我們帶到了我最後要提到的非常重要的部分,在這裡我們將瞭解KNN分類演算法的優缺點。

KNN-的優勢 (Advantages of KNN-)

  • As we discussed earlier, being a non-parametric algorithm, KNN doesn’t require multiple training cycles in order to adapt to the trends within the training data. As a result of this, KNN has an almost negligible training time, and in fact, is one of the fastest machine learning algorithms when it comes to training.

    如前所述,作為一種非引數演算法,KNN不需要多個訓練週期即可適應訓練資料中的趨勢。 因此,KNN的訓練時間幾乎可以忽略不計,實際上,它是訓練時最快的機器學習演算法之一。

  • The implementation of KNN is very easy, as compared to some other, more complex classification algorithms.

    與其他一些更復雜的分類演算法相比,KNN實現非常容易

KNN的缺點 (Disadvantages of KNN–)

  • When it comes to inference, KNN is very compute-intensive. For inference on each test data instance, the algorithm has to calculate its distance from every single point in the training data. In terms of time complexity, for n-number of training data instances and m-number of test data instances, the complexity of the algorithm evaluates to O(m x n).

    說到推理,KNN的計算量很大。 為了推斷每個測試資料例項,演算法必須計算其與訓練資料中每個點的距離。 就時間複雜度而言,對於n個訓練資料例項和m個測試資料例項,演算法的複雜度評估為O(mxn)

  • As the dimensionality (i.e., the total number of features) and the scale of the dataset (i.e., the total number of data instances) increases, the model size also increases, which in turn impacts the performance and speed of the model. Therefore, KNN is not a good algorithm choice when it comes to high dimensional, large scale datasets.

    隨著維度(即要素的總數)和資料集的規模(即資料例項的總數)的增加,模型的大小也隨之增加,進而影響模型的效能和速度。 因此,對於高維,大規模資料集,KNN並不是一個好的演算法選擇。

  • The KNN algorithm is very sensitive to outliers in the data. Even a slight increase in the noise within the dataset might drastically affect the model’s performance.

    KNN演算法對資料中的異常值非常敏感。 即使資料集中的噪聲稍微增加,也可能嚴重影響模型的效能。

With this, we finally come to an end of today’s learning session. In another article of mine, I have directly pitted a bunch of machine learning algorithms against each other. There, you can check out how KNN fares against other classification algorithms like logistic regression, decision tree classifiers, random forest ensembles, etc.

這樣,我們終於結束了今天的學習環節。 在我的另一篇文章中,我直接將一堆機器學習演算法相互比較。 在那裡,您可以檢視KNN如何與其他分類演算法(例如邏輯迴歸,決策樹分類器,隨機森林合奏等)相提並論。

By the way, this was the fourth article in my ML from Scratch series, where I cover different machine learning algorithms and their mathematical foundations in detail. If you are interested in learning more, the other articles in this series are-

順便說一句,這是我的Scratch系列機器學習中的第四篇文章,其中我詳細介紹了不同的機器學習演算法及其數學基礎。 如果您有興趣瞭解更多資訊,本系列的其他文章是-

Link to the project GitHub files.

連結到專案GitHub檔案

If you liked the article and would love to keep seeing more articles in the ML from Scratch series, make sure you hit that follow button.

如果您喜歡這篇文章並且希望繼續在Scratch系列的ML中看到更多文章,請確保單擊該關注按鈕。

Happy learning!

學習愉快!

翻譯自: https://towardsdatascience.com/ml-from-scratch-k-nearest-neighbors-classifier-3fc51438346b