隨機森林之根據汽車特徵評估質量

阿新 • • 發佈：2020-12-19

實驗內容

根據汽車特徵評估質量
接下來看看如何用分類技術解決現實問題。我們將用一個包含汽車多種細節的資料集，例如車門數量、後備箱大小、維修成本等，來確定汽車的質量。分類的目的是把車輛的質量分成4種型別：不達標、達標、良好、優秀。

準備工作

你可以從https://archive.ics.uci.edu/ml/datasets/Car+Evaluation下載資料集。你需要把資料集中的每個值看成是字串。考慮資料集中的6個屬性，其取值範圍是這樣的：
 buying：取值範圍是vhigh、high、med、low；
 maint：取值範圍是vhigh、high、med、low；
 doors：取值範圍是2、3、4、5等；
 persons：取值範圍是2、4等；
 lug_boot：取值範圍是small、med、big；
 safety：取值範圍是low、med、high。

考慮到每一行都包含字串屬性，需要假設所有特徵都是字串，並設定分類器。在上一章中，我們用隨機森林建立過迴歸器，這裡再用隨機森林建立分類器。

這是原始碼和資料集，我把它放在百度網盤了，需要自取
連結：https://pan.baidu.com/s/1liNa_Ptggr9zwxCjnldb2w
提取碼：gr2f
複製這段內容後開啟百度網盤手機App，操作更方便哦

實驗步驟

# based on 2_9 car.py
# goal: generate validation curves about classifier's performance

import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
import matplotlib.pyplot as plt


input_file = 'car.data.txt'

# Reading the data
X = []
y = []
count = 0


with open(input_file, 'r') as f:
    for line in f.readlines():
        data = line[:-1].split(',')
        X.append(data)

X = np.array(X)

# 標記編碼
label_encoder = []
X_encoded = np.empty(X.shape)
for i,item in enumerate(X[0]):
    label_encoder.append(preprocessing.LabelEncoder())
    X_encoded[:, i] = label_encoder[-1].fit_transform(X[:, i])

X = X_encoded[:, :-1].astype(int)
y = X_encoded[:, -1].astype(int)


# 森林中樹的數量： 200
# 決策樹 劃分選擇： 資訊熵
# 決策樹的最大深度： 10
# 控制從原始的資料集中採取有放回的抽樣  引數=10


params = {"n_estimators":200, "criterion":'entropy', "max_depth":10, "random_state":10}

classifier = RandomForestClassifier(**params)
classifier.fit(X, y)

# cross validation
accuracy = model_selection.cross_val_score(classifier, X, y,scoring='accuracy', cv=3)
print("Accuracy of the classifier: " + str(round(100*accuracy.mean(), 2)) + "%")

# 使用單一資料樣例進行檢驗
input_data = ['vhigh', 'vhigh', '2', '2', 'small', 'low']
input_data_encoded = [-1]*len(input_data)
for i, item in enumerate(input_data):
    input_data_encoded[i] = int(label_encoder[i].transform([input_data[i]]))

input_data_encoded = np.array(input_data_encoded)  # 將標記編碼後的單一樣本轉換成numpy陣列
input_data_encoded = input_data_encoded.reshape(1, len(input_data))

# 列印輸出結果
output_class = classifier.predict(input_data_encoded)
print("Output class:", label_encoder[-1].inverse_transform(output_class)[0])



# 2_10 起始位置


classifier = RandomForestClassifier(max_depth=4, random_state=7) # 固定max_depth引數為4定義分類器
parameter_grid = np.linspace(25, 200, 8).astype(int)
# 觀察評估器數量對訓練得分的影響，評估器每8迭代一次
train_scores, validation_scores = model_selection.validation_curve\
    (classifier, X, y, "n_estimators", parameter_grid, cv=5)
print("\nParam: n_estimators\nTraining scores:\n", train_scores)
print("\nParam: n_estimators\nValidation scores:\n", validation_scores)


# 畫出影象
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title('Training curve')
plt.xlabel("Number of estimators")
plt.ylabel("Accuracy")
plt.show()

# 使用類似的方法對max_depth進行驗證
classifier = RandomForestClassifier(n_estimators=20, random_state=7)
parameter_grid = np.linspace(2, 10, 5).astype(int)
train_scores, valid_scores = model_selection.validation_curve(classifier, X, y,
        "max_depth", parameter_grid, cv=5)
print("\nParam: max_depth\nTraining scores:\n", train_scores)
print("\nParam: max_depth\nValidation scores:\n", validation_scores)

# 畫圖
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title('Validation curve')
plt.xlabel('Maximum depth of the tree')
plt.ylabel('Accuracy')
plt.show()


########################
# Learning curves


classifier = RandomForestClassifier(random_state=7)

parameter_grid = np.array([200, 500, 800, 1100])
train_sizes, train_scores, validation_scores = model_selection.learning_curve(classifier,
        X, y, train_sizes=parameter_grid, cv=5)
print("\n##### LEARNING CURVES #####")
print("\nTraining scores:\n", train_scores)
print("\nValidation scores:\n", validation_scores)

# Plot the curve
# 特別說明：程式碼沒有任何問題，最後這個學習影象畫的不對完全是因為Pycharm軟體緣故，在Spyder上執行沒有任何問題
plt.figure()
plt.plot(parameter_grid, 100*np.average(train_scores, axis=1), color='black')
plt.title('Learning curve')
plt.xlabel('Number of training samples')
plt.ylabel('Accuracy')
plt.show()

結果

總結

提示：這裡對文章進行總結：
例如：以上就是今天要講的內容，本文僅僅簡單介紹了pandas的使用，而pandas提供了大量能使我們快速便捷地處理資料的函式和方法。

本文的文字及圖片來源於網路,僅供學習、交流使用,不具有任何商業用途,如有問題請及時聯絡我們以作處理

想要獲取更多Python學習資料可以加
QQ:2955637827私聊
或加Q群630390733
大家一起來學習討論吧！

隨機森林之根據汽車特徵評估質量

實驗內容

準備工作

實驗步驟

結果

總結

隨機森林之根據汽車特徵評估質量

利用隨機森林進行特徵重要性評估

機器學習之決策樹和隨機森林

資料分析模型之決策樹及隨機森林

機器學習之隨機森林

整合學習方法之隨機森林

拓端tecdat|基於貝葉斯模型的隨機森林預測方法分析汽車燃油經濟性

機器學習演算法之隨機森林的R語言實現-表達晶片示例

python實現H2O中的隨機森林演算法介紹及其專案實戰

pyspark 隨機森林的實現

演演算法崗面試題：模型的bias和variance是什麼？用隨機森林舉例

springboot 之根據傳入引數進行多資料來源動態切換

11-機器學習-隨機森林(bagging裝袋法的代表)

隨機森林

kaggle 入門比賽：使用隨機森林解Bag of Words Meets Bags of Popcorn解題報告

如何統計隨機森林節點數

隨機解調之低通濾波

高併發之快取的特徵和應用

決策樹與隨機森林

pdf工具類之根據頁碼複製（分割）pdf

隨機森林之根據汽車特徵評估質量

實驗內容

準備工作

實驗步驟

結果

總結

相關推薦