1. 程式人生 > >1.3.1 Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別

1.3.1 Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別

0 Preface

相關引數說明

 - Julia: 1.0
 - OS: MacOS 

訓練測試資料百度雲連結:點選下載 密碼: u71o 檔案說明:

 - rf_julia_charReg
     - resizeData.py    #批量重設定圖片尺寸
     - test    #測試圖片檔案
     - testResized    #resized 測試圖片檔案
     - train    #訓練圖片檔案
     - trainResized    #resized 訓練圖片檔案
     - sampleTest.csv    #測試資料csv檔案
     - trainLabels.csv     #訓練資料label csv檔案

1 載入資料

安裝需要使用到的包:

using Images
using DataFrames
using Statistics #use mean(), sum()... function
using DataFrames
using CSV
    注:如果沒有安裝包,使用以下指令碼安裝
import Pkg
Pkg.add([PKG NAME]) #例如:Pkg.add("Images")

讀取圖片檔案資料,並返回矩陣

function read_data(type_data, labelsInfo, imageSize, path)
    x = zeros(size(labelsInfo, 1)
, imageSize) for (index, idImage) in enumerate(labelsInfo.ID) nameFile = "$(path)/$(type_data)Resized/$(idImage).Bmp" img = load(nameFile) temp = float32(img) temp = Gray.(temp) x[index, :
] = reshape(temp, 1, imageSize) end return x end

解釋:

float32(): 將其中的值轉化為浮點數
Gray.(): 將RGB影象轉化為灰度影象
reshape(): 在這裡做的是平鋪工作

設定影象大小以及專案路徑:

imageSize = 400
path = "..."

讀取訓練資料Label

labelsInfoTrain = CSV.read("$(path)/trainLabels.csv")

讀取訓練資料Label 讀取訓練影象資料:

xTrain = read_data("train", labelsInfoTrain, imageSize, path)

讀取訓練影象資料 讀取測試資料Label:

labelsInfoTest = CSV.read("$(path)/sampleSubmission.csv")

讀取測試資料Label

讀取測試影象資料:

xTest = read_data("test", labelsInfoTest, imageSize, path)

讀取測試影象資料

2 訓練隨機森林(train RF)

訓練:

model = build_forest(yTrain, xTrain, 20, 50, 1.0)

解釋:

$3(20):number of features chosen at each random split
$4(50): number of trees
$5(1.0): ratio of subsampling

獲得測試結果:

predTest = apply_forest(model, xTest)

轉化預測結果:

labelsInfoTest.Class = Char.(predTest)

寫入檔案:

CSV.write("$(path)/predTest.csv", labelsInfoTest, header=true)

四折交叉驗證:

accuracy = nfoldCV_forest(yTrain, xTrain, 20, 50, 4, 1.0);
println("4 fold accuracy: $(mean(accuracy))")

3 完整程式碼

using Images
using DataFrames
using Statistics 
using DataFrames
using CSV
using DecisionTree

function read_data(type_data, labelsInfo, imageSize, path)
    x = zeros(size(labelsInfo, 1), imageSize)
    for (index, idImage) in enumerate(labelsInfo.ID)
        nameFile = "$(path)/$(type_data)Resized/$(idImage).Bmp"
        img = load(nameFile)
        temp = float32(img)
        temp = Gray.(temp)
        x[index, :] = reshape(temp, 1, imageSize)
    end
    return x
end


imageSize = 400
path = "/Users/congying/cyWang/projects/julia/kaggleFirstStepsWithJulia/all"
labelsInfoTrain = CSV.read("$(path)/trainLabels.csv")
xTrain = read_data("train", labelsInfoTrain, imageSize, path)
labelsInfoTest = CSV.read("$(path)/sampleSubmission.csv")
xTest = read_data("test", labelsInfoTest, imageSize, path)
yTrain = map(x -> x[1], labelsInfoTrain.Class)
yTrain = Int.(yTrain)


model = build_forest(yTrain, xTrain, 20, 50, 1.0)
predTest = apply_forest(model, xTest)
labelsInfoTest.Class = Char.(predTest)
CSV.write("$(path)/juliaSubmission.csv", labelsInfoTest, header=true)
accuracy = nfoldCV_forest(yTrain, xTrain, 20, 50, 4, 1.0);
println("4 fold accuracy: $(mean(accuracy))")