Python機器學習預測蘋果酸甜度
阿新 • • 發佈:2021-06-21
一、選題背景
經常無法判斷哪個蘋果會比較酸或者甜,有的人喜歡甜,有的人喜歡酸,但是都是隻能運用鄉間辦法以及猜測,都屬於並不科學的辦法,所以想要用機器學習來預測蘋果的酸甜度。
二、機器學習案例設計方案
資料集來源:
資料來源於Kaggle,主要為開發商和資料科學家提供舉辦機器學習競賽、託管資料庫、編寫和分享程式碼的平臺。該平臺已經吸引了80萬名資料科學家的關注,這些使用者資源或許正是吸引谷歌的主要因素。
採用的機器學習框架描述:
卷積神經網路(Convolutional Neural Network, CNN):卷積神經網路是深度學習技術中極具代表的網路結構之一,在影象處理領域取得了很大的成功,在國際標準的 ImageNet 資料集上,許多成功的模型都是基於 CNN 的。
Keras:Keras是一個模型級(model-level)的庫,為開發深度學習模型提供了高層次的構建模組。它不處理張量操作、求微積分等基礎的運算,而是依賴--個專門的、高度優化的張量庫來完成這些運算。這個張量庫就是Keras的後端引擎(backendengine),如TensorFlow等。
TensorFlow :TensorFlow 是一個開源機器學習框架,具有快速、靈活並適合產品級大規模應用等特點,讓每個開發者和研究者都能方便地使用人工智慧來解決多樣化的挑戰。
涉及到的技術難點:
容易測試中斷,需要多多嘗試
匯入所需的庫
1 import os 2 import random3 import itertools 4 import numpy as np 5 import pandas as pd 6 import seaborn as sns 7 import tensorflow as tf 8 import matplotlib.pyplot as plt 9 from tensorflow.keras import layers 10 from sklearn import preprocessing 11 from sklearn.decomposition import PCA 12 from sklearn import model_selection as ms13 from sklearn.metrics import confusion_matrix 14 from tensorflow.keras.optimizers import Adamax 15 from tensorflow.keras.initializers import RandomNormal 16 from sklearn.linear_model import LogisticRegression 17 from sklearn.metrics import classification_report 18 from sklearn.model_selection import train_test_split 19 from sklearn.preprocessing import StandardScaler 20 from sklearn import metrics
檢視庫對應的版本
1 pip list
原始資料讀取
定義讀取txt檔案函式,對資料進行讀取
1 def getInfo(samplePath): 2 with open(samplePath, "r") as f: # 開啟檔案 3 datas = f.readlines() 4 5 totalList = [] # 定義空列表 6 for data in datas: 7 list1 = data.split("\t") 8 list2 = [] 9 for i in list1: 10 list2.append(float(i)) 11 totalList.append(list2) # 將讀取後的檔案放入列表 12 return totalList # 返回裝有資料的列表
將處特徵資料的txt檔案讀取後放至空列表中
1 path0 = './res_data/Feature_0.txt' 2 path1 = './res_data/Feature_1.txt' 3 path2 = './res_data/Feature_2.txt' 4 path3 = './res_data/Feature_3.txt' 5 path4 = './res_data/Feature_4.txt' 6 path5 = './res_data/Feature_5.txt' 7 path6 = './res_data/Feature_6.txt' 8 path7 = './res_data/Feature_7.txt' 9 path8 = './res_data/Feature_8.txt' 10 path9= './res_data/Feature_9.txt' 11 path10 = './res_data/Feature_10.txt' 12 path11 = './res_data/Feature_11.txt' 13 path12 = './res_data/Feature_12.txt' 14 path13 = './res_data/Feature_13.txt' 15 path14 = './res_data/Feature_14.txt' 16 path15 = './res_data/Feature_15.txt' 17 path16 = './res_data/Feature_16.txt' 18 19 totalList = [] 20 data0 = getInfo(path0) 21 data1 = getInfo(path1) 22 data2 = getInfo(path2) 23 data3 = getInfo(path3) 24 data4 = getInfo(path4) 25 data5 = getInfo(path5) 26 data6 = getInfo(path6) 27 data7 = getInfo(path7) 28 data8 = getInfo(path8) 29 data9 = getInfo(path9) 30 data10 = getInfo(path10) 31 data11 = getInfo(path11) 32 data12 = getInfo(path12) 33 data13 = getInfo(path13) 34 data14 = getInfo(path14) 35 data15 = getInfo(path15) 36 data16 = getInfo(path16) 37 38 for i in range(len(data1)): 39 totalList.append(data1[i] + data2[i] + data3[i] + data4[i] + data5[i] + data6[i] + data7[i] + data8[i] + data9[i] + data10[i] + data11[i] + data12[i] + data13[i] + data14[i] + data15[i] + data16[i])
讀取標籤資料集(已對資料進行了二分類標籤處理,0-酸蘋果,1-甜蘋果)
1 labelPath = './labels.txt' 2 3 labels = open(labelPath, "r") 4 label = labels.readlines() 5 trueLabelList = [] 6 for r in label: 7 labelList = [] 8 labelList = r.split("\t") 9 trueLabelList.append(int(labelList[len(labelList)-1]))
檢視資料的數量
1 label0 = np.sum(np.array(trueLabelList) == 0) 2 label1 = np.sum(np.array(trueLabelList) == 1) 3 4 print(label0) 5 print(label1)
資料集打亂及劃分
1 totalArr = np.array(totalList, dtype=float) # 將資料轉為陣列格式 2 trueLabelArr = np.array(trueLabelList, dtype=int) 3 4 indx = [i for i in range(len(totalArr))] 5 random.shuffle(indx) 6 totalArr = totalArr[indx] 7 trueLabelArr = trueLabelArr[indx] 8 9 input_features = preprocessing.StandardScaler().fit_transform(totalArr) # 對資料進行標準化 10 11 train_x, test_x, train_y, test_y = ms.train_test_split(input_features, trueLabelArr, test_size=0.2) # 資料集劃分
1 # 搭建網路模型 2 model= tf.keras.Sequential() 3 model.add(layers.Dense(128,kernel_initializer=RandomNormal(mean=0.0,stddev=0.05,seed=None),input_dim=43620)) 4 5 # model.add(layers.Dropout(0.2)) 6 model.add(layers.Dense(256,activation='relu',use_bias=True)) 7 8 # model.add(layers.Dropout(0.3)) 9 model.add(layers.Dense(512,activation='relu',use_bias=True)) 10 11 # model.add(layers.Dropout(0.4)) 12 model.add(layers.Dense(256, activation='relu', use_bias=True)) 13 14 # model.add(layers.Dropout(0.3)) 15 model.add(layers.Dense(128, activation='relu', use_bias=True)) 16 17 # model.add(layers.Dropout(0.2)) 18 19 # model.add(layers.Dense(128,activation='relu',use_bias=True)) 20 21 # model.add(layers.Dropout(0.1)) 22 model.add(layers.Dense(1,activation='sigmoid')) 23 24 model.compile(optimizer=tf.keras.optimizers.SGD(lr=0.0001), 25 loss = 'binary_crossentropy', metrics=['accuracy']) 26 27 model.summary()
模型訓練過程
1 model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.001), 2 loss = 'binary_crossentropy', metrics=['accuracy']) 3 history = model.fit(train_x, train_y,validation_split=0.1, epochs=100,batch_size=32, verbose=1)
1 # 使用predict方法進行預測 2 yPre = model.predict(test_x)
1 # 列印輸出BP神經網路的評估 2 print("BP神經網路準確值:{}".format(metrics.accuracy_score(test_y,yPre.round()))) 3 print("BP神經網路精確度:{}".format(metrics.precision_score(test_y,yPre.round()))) 4 print("BP神經網路召回率:{}".format(metrics.recall_score(test_y,yPre.round()))) 5 print("BP神經網路的F1評分{}".format(metrics.f1_score(test_y,yPre.round())))
打印出其混淆矩陣
1 cMatrix = metrics.confusion_matrix(test_y, yPre.round(), normalize=None) 2 3 pdMatrix = pd.DataFrame(cMatrix, index=["0","1"], columns=["0","1"]) 4 plt.subplots(figsize=(8,8)) 5 sns.heatmap(pdMatrix, annot=True, cbar=None, fmt='g') 6 plt.ylabel("True Class") 7 plt.xlabel("Predicted Class") 8 plt.show()
畫出訓練集和驗證集accuracy以及loss的走向
1 acc = history.history['accuracy'] 2 val_acc = history.history['val_accuracy'] 3 loss = history.history['loss'] 4 val_loss = history.history['val_loss'] 5 6 epochs = range(len(acc)) 7 8 plt.subplots(figsize=(8,8)) 9 plt.plot(epochs, acc, 'b', label='Training accuracy') 10 plt.plot(epochs, val_acc, 'r', label='Validation accuracy') 11 plt.title('Training and validation accuracy') 12 plt.legend() 13 14 # plt.figure() 15 plt.subplots(figsize=(8,8)) 16 plt.plot(epochs, loss, 'b', label='Training Loss') 17 plt.plot(epochs, val_loss, 'r', label='Validation Loss') 18 plt.title('Training and validation loss') 19 plt.legend() 20 21 plt.show()
ROC曲線
1 yPreProb = model.predict_proba(test_x) 2 p1, p2, _ = metrics.roc_curve(test_y, yPreProb) 3 PBAUC = metrics.roc_auc_score(test_y, yPreProb) 4 5 plt.subplots(figsize=(5,5)) 6 plt.plot(p1, p2, color='r', label="Neural Network auc={:.3f}".format(PBAUC)) 7 plt.xlabel("False Positive Rate") 8 plt.ylabel("True Positive Rate") 9 plt.legend(loc=4) 10 plt.show()