Julia機器學習實戰——使用Random Forest隨機森林進行字元影象識別
阿新 • • 發佈:2018-11-22
文章目錄
0 Preface
相關引數說明
- Julia: 1.0
- OS: MacOS
訓練測試資料百度雲連結:點選下載 密碼: u71o
檔案說明:
- rf_julia_charReg - resizeData.py #批量重設定圖片尺寸 - test #測試圖片檔案 - testResized #resized 測試圖片檔案 - train #訓練圖片檔案 - trainResized #resized 訓練圖片檔案 - sampleTest.csv #測試資料csv檔案 - trainLabels.csv #訓練資料label csv檔案
1 載入資料
安裝需要使用到的包:
using Images
using DataFrames
using Statistics #use mean(), sum()... function
using DataFrames
using CSV
注:如果沒有安裝包,使用以下指令碼安裝
import Pkg
Pkg.add([PKG NAME]) #例如:Pkg.add("Images")
讀取圖片檔案資料,並返回矩陣
function read_data(type_data, labelsInfo, imageSize, path) x = zeros(
size(labelsInfo, 1), imageSize) for (index, idImage) in enumerate(labelsInfo.ID) nameFile = "$(path)/$(type_data)Resized/$(idImage).Bmp" img = load(nameFile) temp = float32(img) temp = Gray.(temp) x[index, :] = reshape(temp, 1, imageSize) end return x end
解釋:
float32(): 將其中的值轉化為浮點數
Gray.(): 將RGB影象轉化為灰度影象
reshape(): 在這裡做的是平鋪工作
設定影象大小以及專案路徑:
imageSize = 400
path = "..."
讀取訓練資料Label
labelsInfoTrain = CSV.read("$(path)/trainLabels.csv")
讀取訓練影象資料:
xTrain = read_data("train", labelsInfoTrain, imageSize, path)
讀取測試資料Label:
labelsInfoTest = CSV.read("$(path)/sampleSubmission.csv")
讀取測試影象資料:
xTest = read_data("test", labelsInfoTest, imageSize, path)
2 訓練隨機森林(train RF)
訓練:
model = build_forest(yTrain, xTrain, 20, 50, 1.0)
解釋:
$3(20):number of features chosen at each random split
$4(50): number of trees
$5(1.0): ratio of subsampling
獲得測試結果:
predTest = apply_forest(model, xTest)
轉化預測結果:
labelsInfoTest.Class = Char.(predTest)
寫入檔案:
CSV.write("$(path)/predTest.csv", labelsInfoTest, header=true)
四折交叉驗證:
accuracy = nfoldCV_forest(yTrain, xTrain, 20, 50, 4, 1.0);
println("4 fold accuracy: $(mean(accuracy))")
3 完整程式碼
using Images
using DataFrames
using Statistics
using DataFrames
using CSV
using DecisionTree
function read_data(type_data, labelsInfo, imageSize, path)
x = zeros(size(labelsInfo, 1), imageSize)
for (index, idImage) in enumerate(labelsInfo.ID)
nameFile = "$(path)/$(type_data)Resized/$(idImage).Bmp"
img = load(nameFile)
temp = float32(img)
temp = Gray.(temp)
x[index, :] = reshape(temp, 1, imageSize)
end
return x
end
imageSize = 400
path = "/Users/congying/cyWang/projects/julia/kaggleFirstStepsWithJulia/all"
labelsInfoTrain = CSV.read("$(path)/trainLabels.csv")
xTrain = read_data("train", labelsInfoTrain, imageSize, path)
labelsInfoTest = CSV.read("$(path)/sampleSubmission.csv")
xTest = read_data("test", labelsInfoTest, imageSize, path)
yTrain = map(x -> x[1], labelsInfoTrain.Class)
yTrain = Int.(yTrain)
model = build_forest(yTrain, xTrain, 20, 50, 1.0)
predTest = apply_forest(model, xTest)
labelsInfoTest.Class = Char.(predTest)
CSV.write("$(path)/juliaSubmission.csv", labelsInfoTest, header=true)
accuracy = nfoldCV_forest(yTrain, xTrain, 20, 50, 4, 1.0);
println("4 fold accuracy: $(mean(accuracy))")