1. 程式人生 > 實用技巧 >用卷積神經網路進行影象分類為何卷積神經網路優於傳統卷積神經網路

用卷積神經網路進行影象分類為何卷積神經網路優於傳統卷積神經網路

In the last decade, with the discovery of deep learning, the field of image classification has experienced a renaissance. Traditional machine learning methods have been replaced by newer and more powerful deep learning algorithms, such as the convolutional neural network. However, to truly understand and appreciate deep learning, we must know why does it succeed where the other methods fail. In this article, we try to answer some of those questions, by applying various classification algorithms on the Fashion MNIST dataset.

在過去的十年中,隨著深度學習的發現,影象分類領域經歷了復興。 傳統的機器學習方法已被更新和更強大的深度學習演算法(例如卷積神經網路)取代。 但是,要真正理解和欣賞深度學習,我們必須知道為什麼其他方法失敗時它會成功。 在本文中,我們嘗試通過對Fashion MNIST資料集應用各種分類演算法來回答其中的一些問題。

Dataset information Fashion MNIST was introduced in August 2017, by research lab at Zalando Fashion. Its goal is to serve as a new benchmark for testing machine learning algorithms, as MNIST became too easy and overused. While MNIST consists of handwritten digits, Fashion MNISTis made of images of 10 different clothing objects. Each image has the following properties:

資料集資訊Fashion MNIST由Zalando Fashion的研究實驗室於2017年8月推出。 隨著MNIST變得過於簡單和過度使用,其目標是成為測試機器學習演算法的新基準。 MNIST由手寫數字組成,而Fashion MNIST由10種不同服裝物件的影象組成。 每個影象具有以下屬性:

  • Its size is 28 × 28 pixels.

    尺寸為28×28畫素。
  • Rotated accordingly and represented in grayscale, with integer values ranging from 0 to 255.

    進行相應旋轉並以灰度表示,整數值的範圍為0到255。
  • Blank space represented by black color and having value 0.

    黑色表示的空白,值為0。

In the dataset, we distinguish between the following clothing objects:

在資料集中,我們區分以下服裝物件:

  • T-shirt/Top

    T恤/上衣
  • Trousers

    長褲
  • Pullover

    拉過來
  • Dress

    連衣裙
  • Coat

    塗層
  • Sandal

    涼鞋
  • Shirt

    襯衫
  • Sneaker

    運動鞋
  • Bag

  • Ankle Boot

    腳踝靴

Exploratory data analysis As the dataset is available as the part of the Keras library, and the images are already processed, there is no need for much preprocessing on our part. The only changes we made was converting images from a 2D array into a 1D array, as that makes them easier to work with.

探索性資料分析由於資料集可以作為Keras庫的一部分使用,並且影象已經過處理,因此我們不需要太多預處理。 我們所做的唯一更改是將影象從2D陣列轉換為1D陣列,因為這使它們更易於使用。

The dataset consists of 70000 images, of which the 60000 make the training set, and 10000 the test set. Like in the original MNIST dataset, the items are distributed evenly (6000 of each of training set and 1000 in the test set).

資料集包含70000張影象,其中60000張為訓練集,10000張為測試集。 像在原始MNIST資料集中一樣,專案也平均分配(每個訓練集6000個,測試集中1000個)。

Image for post
Examples of images of different items of clothing. Photo by the author.
不同衣物的影象示例。 圖片由作者提供。

However, a single image still has 784 dimensions, so we turned to the principal component analysis (PCA), to see which pixels are the most important. We set the traditional benchmark of 80% of the cumulative variance, and the plot told us that that is made possible with only around 25 principal components (3% of the total number of PCs). However, that is not surprising, as, we can see in the photo above, that there is a lot of shared unused space in each image and that different classes of clothing have different parts of images that are black. The latter can be connected to the fact that around 70% of the cumulative variance is explained by only 8 principal components.

但是,單個影象仍然具有784個尺寸,因此我們轉向了主成分分析(PCA),以瞭解哪些畫素最重要。 我們將傳統基準設定為累積方差的80%,該圖告訴我們,只有大約25個主要元件(佔PC總數的3%)才能實現這一點。 但是,這並不奇怪,因為我們可以在上一張照片中看到,每個影象中都有大量共享的未使用空間,並且不同類別的衣服具有不同的黑色影象部分。 後者可能與以下事實有關:僅由8個主要成分來解釋大約70%的累積方差。

Image for post
Cumulative percent of variance explained. Photo by the author.
解釋了累積的方差百分比。 圖片由作者提供。

We will apply the principal components in the Logistic regression, Random Forest and Support Vector Machines.

我們將在Logistic迴歸,隨機森林和支援向量機中應用主要元件。

The image classification problems represent just a small subset of classification problems. The most used image classification methods are deep learning algorithms, one of which is the convolutional neural network. The rest of the employed methods will be a small collection of common classification methods. As class labels are evenly distributed, with no misclassification penalties, we will evaluate the algorithms using accuracy metric.

影象分類問題僅代表分類問題的一小部分。 最常用的影象分類方法是深度學習演算法,其中之一就是卷積神經網路。 其餘採用的方法將是一小部分常見分類方法。 由於類別標籤均勻分佈,沒有分類錯誤的懲罰,因此我們將使用準確性度量來評估演算法。

CONVOLUTIONAL NEURAL NETWORK (CNN) The first method we employed was CNN. As the images were in grayscale, we applied only one channel. We selected the following architecture:

卷積神經網路(CNN)我們採用的第一種方法是CNN。 由於影象是灰度影象,因此我們僅應用了一個通道。 我們選擇了以下架構:

  • Two convolutional layers with 32 and 64 filters, 3 × 3 kernel size, and relu activation.

    具有32和64個濾鏡,3×3核心大小和relu啟用的兩個卷積層。
  • The polling layers were chosen to operate of tiles size 2 × 2 and to select the maximal element in them.

    選擇輪詢層以操作大小為2×2的圖塊並在其中選擇最大元素。
  • Two sets of dense layers, with the first one selecting 128 features, having relu and softmax activation.

    兩組密集層,其中第一層選擇128個要素,具有r​​elu和softmax啟用。

There is nothing special about this architecture. In fact, it is one of the simplest architectures we can use for a CNN. That shows us the true power of this class of methods: getting great results with a benchmark structure.

這種架構沒有什麼特別的。 實際上,它是我們可用於CNN的最簡單的體系結構之一。 這向我們展示了此類方法的真正力量:通過基準結構獲得出色的結果。

For loss function, we chose categorical cross-entropy. To avoid overfitting, we have chosen 9400 images from the training set to serve as a validation set for our parameters. We used novel optimizer adam, which improves overstandard gradient descent methods and uses a different learning rate for each parameter and the batch size equal to 64. The model was trained in 50 epochs. We present the accuracy and loss values in the graphs below.

對於損失函式,我們選擇分類交叉熵。 為了避免過度擬合,我們從訓練集中選擇了9400張影象作為我們引數的驗證集。 我們使用了新穎的優化程式adam,它改進了超標準的梯度下降方法,並且對每個引數使用了不同的學習率,並且批處理大小等於64。模型在50個曆元內進行了訓練。 我們在下圖中顯示精度和損耗值。

Image for post
Photo by the author.
圖片由作者提供。
Image for post
Photo by the author.
圖片由作者提供。

We see that the algorithm converged after 15 epochs, that it is not overtrained, so we tested it. The obtained testing accuracy was equal to89%, which is the best result obtained out of all methods!

我們看到該演算法在15個紀元後收斂,沒有受到過度訓練,因此我們對其進行了測試。 獲得的測試精度等於89%,這是所有方法中獲得的最佳結果!

Before proceeding to other methods, let’s explain what have the convolutional layers done. An intuitive explanation is that the first layer was capturing straight lines and the second one curves. On both layers we applied max pooling, which selects the maximal value in the kernel, separating clothing parts from blank space. In that way, we capture the representative nature of data. In other, neural networks perform feature selection by themselves. After the last pooling layer, we get an artificial neural network. Because we are dealing with the classification problem, the final layeruses softmax activation to get class probabilities. As class probabilities follow a certain distribution, cross-entropy indicates the distance from networks preferred distribution.

在繼續其他方法之前,讓我們先解釋一下卷積層的作用。 直觀的解釋是,第一層捕獲直線,第二層捕獲曲線。 在這兩個層上,我們都應用了最大池化(max pooling),該池選擇核心中的最大值,從而將衣物部分與空白空間分開。 這樣,我們可以捕獲資料的代表性。 換句話說,神經網路自己執行特徵選擇。 在最後的合併層之後,我們得到了一個人工神經網路。 因為我們正在處理分類問題,所以最後一層使用softmax啟用來獲取類概率。 當類別概率遵循某個分佈時,交叉熵表示距網路首選分佈的距離。

Multinomial Logistic Regression As pixel values are categorical variables, we can apply Multinomial Logistic Regression. We apply it one vs rest fashion, training ten binary Logistic Regression classifiers, that we will use to select items. In order not to overtrain, we have used the L2 regularization. We get 80% accuracy on this algorithm, 9% less accurate than convolutional neural networks. But we have to take into account that this algorithm worked on grayscale images which are centred and normally rotated, with lots of blank space, so it may not work for more complex images.

多項邏輯迴歸由於畫素值是分類變數,因此我們可以應用多項邏輯迴歸。 我們將其應用於休息方式與休息方式,訓練了十個二元Logistic迴歸分類器,這些分類器將用於選擇項。 為了不過度訓練,我們使用了L2正則化。 我們在此演算法上獲得80%的精度,比卷積神經網路的精度低9%。 但是我們必須考慮到,該演算法適用於居中且正常旋轉的灰度影象,並且有很多空白,因此對於較複雜的影象可能不起作用。

Nearest neighbors and centroid algorithms We used two different nearest distance algorithms:

最近鄰居和質心演算法我們使用了兩種不同的最近距離演算法:

  • K-nearest neighbors

    K近鄰
  • Nearest Centroid

    最近的質心

Nearest centroid algorithm finds mean values of elements of each class and assigns test element to the class to which the nearest centroid is assigned. Both algorithms were implemented with respect to L1 and L2 distance. The accuracy for k-nearest algorithms was 85%, while the centroid algorithm had the accuracy of 67%. These results were obtained for k=12. High accuracy of the k-nearest neighbors tells us that the images belonging to the same class tend to occupy similar places on images, and also have similar pixels intensities. While nearest neighbours obtained good results, they still perform worse than CNNs, as they don’t operate in neighbourhood of each specific feature, while centroids fail since they don’t distinguish between similar-looking objects (e.g. pullover vs t-shirt/top)

最近質心演算法查詢每個類別的元素的平均值,並將測試元素分配給分配了最近質心的類別。 兩種演算法都是針對L1和L2距離實現的。 k最近演算法的精度為85%,而質心演算法的精度為67%。 對於k = 12獲得這些結果。 k最近鄰的高精度告訴我們,屬於同一類的影象傾向於在影象上佔據相似的位置,並且具有相似的畫素強度。 雖然最近的鄰居取得了不錯的效果,但它們的表現仍然比CNN差,因為它們不在每個特定特徵附近工作,而質心失敗了,因為它們無法區分外觀相似的物體(例如套衫vs T恤/上衣/上衣)

Random Forest To select the best parameters for estimation, we performed grid search with squared root (bagging) and the full number of features, Gini and entropy criterion, and with trees having maximal depth 5 and 6. Grid search suggested that we should use root squared number of features with entropy criterion (both expected for classification task). However, obtained accuracy was only equal to 77%, implying that random forest is not a particularly good method for this task. The reason it failed is that principal components don’t represent the rectangular partition that an image can have, on which random forests operate. The same reasoning applies to the full-size images as well, as the trees would be too deep and lose interpretability.

隨機森林要選擇估計的最佳引數,我們進行與平方根(套袋)和全多項功能,基尼和熵準則網格搜尋,並用具有最大深度5和6網格搜尋建議我們應該用根樹具有熵標準的特徵的平方數(均屬於分類任務)。 但是,獲得的準確性僅等於77%,這意味著隨機森林並不是執行此任務的特別好的方法。 失敗的原因是主要成分不代表影象可以具有的矩形分割槽,隨機森林在該矩形分割槽上執行。 同樣的道理也適用於全尺寸影象,因為樹木太深並且無法解釋。

Support Vector Machines (SVM) We applied SVM using radial and polynomial kernel. The radial kernel has 77% accuracy, while the polynomial kernel fails miserably and it is only 46% accurate. Although image classification is not their strength, are still highly useful for other binary classifications tasks. Their biggest caveat is that they require feature selection, which brings accuracy down, and without it, they can be computationally expensive. Also, they apply multiclass classification in a one-vs-rest fashion, making it harder to efficiently create separating hyperplane, thus losing value when working with non-binary classification tasks.

支援向量機(SVM)我們使用徑向和多項式核心應用了SVM。 徑向核的準確度為77%,而多項式核則嚴重失敗,準確度僅為46%。 儘管影象分類不是它們的優勢,但對於其他二進位制分類任務仍然非常有用。 他們最大的警告是,他們需要特徵選擇,這會降低準確性,而沒有它,它們在計算上可能會很昂貴。 而且,它們以“一對多”的方式應用多類分類,這使得更難有效地建立分離的超平面,從而在處理非二進位制分類任務時失去了價值。

Conclusions In this article, we applied various classification methods on an image classification problem. We have explained why the CNNs are the best method we can employ out of considered ones, and why do the other methods fail. Some of the reasons why CNNs are the most practical and usually the most accurate method are:

結論在本文中,我們對影象分類問題應用了各種分類方法。 我們已經解釋了為什麼CNN是我們可以考慮使用的最佳方法,以及其他方法為什麼會失敗。 CNN最實用且通常最準確的方法的一些原因包括:

  • They can transfer learning through layers, saving inferences, and making new ones on subsequent layers.

    他們可以通過層級轉移學習,儲存推理並在後續層級上進行新的學習。
  • No need for feature extraction before using the algorithm, it is done during training.

    在使用演算法之前不需要特徵提取,它是在訓練期間完成的。
  • It recognizes important features.

    它認識到重要功能。

However, they also have their caveats. They are known to fail on images that are rotated and scaled differently, which is not the case here, as the data was pre-processed. And, although the other methods fail to give that good results on this dataset, they are still used for other tasks related to image processing (sharpening, smoothing etc.).

但是,他們也有一些警告。 眾所周知,它們在旋轉和縮放比例不同的影象上會失敗,在這種情況下不是這樣,因為資料已經過預處理。 而且,儘管其他方法無法在此資料集上獲得良好的結果,但它們仍用於與影象處理有關的其他任務(銳化,平滑等)。

Code: https://github.com/radenjezic153/Stat_ML/blob/master/project.ipynb

程式碼: https//github.com/radenjezic153/Stat_ML/blob/master/project.ipynb

翻譯自: https://towardsdatascience.com/image-classification-with-fashion-mnist-why-convolutional-neural-networks-outperform-traditional-df531e0533c2