Cross-validation優化資訊多分類模型引數
在上一篇(https://blog.csdn.net/baymax_007/article/details/82775180)中,只是簡單選用邏輯迴歸、決策樹、隨機森林、多層感知分類器、xgboost、樸素貝葉斯分類對資訊進行分類。然而,實際分類模型效果受模型初始化的引數影響,選取合適引數可以提高分類效果。
交叉驗證(Cross-validation)是常用的模型引數優化方法。CrossValidator將資料集劃分為若干子集分別地進行訓練和測試。如當k=3時,CrossValidator產生3個訓練資料與測試資料對,每個資料對使用2/3的資料來訓練,1/3的資料來測試。對於一組特定的引數表,CrossValidator計算基於三組不同訓練資料與測試資料對訓練得到的模型的評估準則的平均值。確定最佳引數表後,CrossValidator最後使用最佳引數表基於全部資料來重新擬合估計器。
spark ml提供ParamBuilder來重構模型搜尋引數網路,CrossValidator例項可以設定搜尋網路引數param和模型pipelines。
一、試驗設計
1. 特徵提取引數選取
CountVectorizer將分詞後關鍵詞轉換為數值型的關鍵詞頻數特徵,涉及詞彙表的最大含量vocabsize和詞彙表中的詞語至少要在多少個不同文的那個中出現的引數minDF。
2.分類模型引數選擇
2.1 邏輯迴歸
regParam:正則化係數,防止過擬合
elasticNetParam:正則化正規化比,L1:L2
2.2 決策樹
impurity:不純度,可選型別entropy,gini
maxDepth:樹的最大深度
maxBins:離散化連續特徵的最大劃分數
minInfoGain:一個節點分裂的最小資訊增益比
minInstancesPerNode:每個節點包含的最小樣本數
2.3 隨機森林
impurity:純度度量函式,可選型別entropy,gini
maxDepth:樹的最大深度
maxBins:連續特徵離散化的最大劃分數
minInfoGain:一個節點分裂的最小資訊增益比
minInstancesPerNode:每個節點包含的最小樣本數
numTrees:bootstrap訓練需要的樹個數
subsamplingRate:訓練森林時每次取樣的比例
2.4 XGBOOST
eta:學習速率,預設0.3
maxDepth:樹最大深度
numRound:boosting次數
apha:L1正則係數
2.5 樸素貝葉斯分類
smoothing:平滑引數
3.交叉驗證
ml包提供交叉驗證尋找最佳引數api,使用ParamGrid通過addGrid支援對pipeline模型的網路引數設定,並在CrossValidator訓練中尋找最優引數。
4.具體設計
二、程式碼實現
1.邏輯迴歸引數尋優
1.1 程式碼
val lrStartTime = new Date().getTime
val lrPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,lr,converts))
// 交叉驗證引數設定和模型
val lrParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(lr.regParam,Array(0.1,0.01))
.addGrid(lr.elasticNetParam,Array(0.1,0.0))
.build()
val lrCv = new CrossValidator()
.setEstimator(lrPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(lrParamGrid)
.setNumFolds(2)
.setParallelism(4)
val lrModel = lrCv.fit(train)
val lrValiad = lrModel.transform(train)
val lrPredictions = lrModel.transform(test)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracyLrt = evaluator.evaluate(lrValiad)
println(s"邏輯迴歸驗證集分類準確率 = $accuracyLrt")
val accuracyLrv = evaluator.evaluate(lrPredictions)
println(s"邏輯迴歸測試集分類準確率 = $accuracyLrv")
val lrEndTime = new Date().getTime
val lrCostTime = (lrEndTime - lrStartTime)/lrParamGrid.length
println(s"邏輯迴歸分類耗時:$lrCostTime")
// 獲取最優模型
val bestLrModel = lrModel.bestModel.asInstanceOf[PipelineModel]
val bestLrVectorizer = bestLrModel.stages(3).asInstanceOf[CountVectorizerModel]
val blvv = bestLrVectorizer.getVocabSize
val blvm = bestLrVectorizer.getMinDF
val bestLr = bestLrModel.stages(4).asInstanceOf[LogisticRegressionModel]
val blr = bestLr.getRegParam
val ble = bestLr.getElasticNetParam
println(s"countVectorizer模型最優引數:\ngetVocabSize= $blvv,minDF = $blvm\n邏輯迴歸模型最優引數:\nregParam = $blr,elasticNetParam = $ble")
1.2 尋優結果
邏輯迴歸驗證集分類準確率 = 1.0 邏輯迴歸測試集分類準確率 = 0.8187134502923976 邏輯迴歸分類耗時:17404 countVectorizer模型最優引數: vocabSize = 262144,minDF = 2.0 邏輯迴歸模型最優引數: regParam = 0.01,elasticNetParam = 0.1
2.決策樹引數尋優
2.1 程式碼
val dtStartTime = new Date().getTime
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(123456L)
val dtPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,dt,converts))
// 交叉驗證引數設定和模型
val dtParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(dt.impurity,Array("entropy","gini"))
.addGrid(dt.maxDepth,Array(5,10))
.addGrid(dt.maxBins,Array(32,500))
.addGrid(dt.minInfoGain,Array(0.1,0.01))
.addGrid(dt.minInstancesPerNode,Array(5,10))
.build()
val dtCv = new CrossValidator()
.setEstimator(dtPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(dtParamGrid)
.setNumFolds(2)
.setParallelism(7)
val dtModel = dtCv.fit(train)
val dtValiad = dtModel.transform(train)
val dtPredictions = dtModel.transform(test)
val accuracyDtt = evaluator.evaluate(dtValiad)
println(s"決策樹驗證集分類準確率 = $accuracyDtt")
val accuracyDtv = evaluator.evaluate(dtPredictions)
println(s"決策樹測試集分類準確率 = $accuracyDtv")
val dtEndTime = new Date().getTime
val dtCostTime = (dtEndTime - dtStartTime)/dtParamGrid.length
println(s"決策樹分類耗時:$dtCostTime")
// 獲取最優模型
val bestDtModel = dtModel.bestModel.asInstanceOf[PipelineModel]
val bestDtVectorizer = bestDtModel.stages(3).asInstanceOf[CountVectorizerModel]
val bdvv = bestDtVectorizer.getVocabSize
val bdvm = bestDtVectorizer.getMinDF
val bestDt = bestDtModel.stages(4).asInstanceOf[DecisionTreeClassificationModel]
val bdi = bestDt.getImpurity
val bdmd = bestDt.getMaxDepth
val bdmb = bestDt.getMaxBins
val bdmig = bestDt.getMinInfoGain
val bdmipn = bestDt.getMinInstancesPerNode
println(s"countVectorizer模型最優引數:\nvocabSize = $bdvv,minDF = $bdvm\n決策樹分類模型最優引數:\nmpurity = $bdi,maxDepth = $bdmd,maxBins = $bdmb,minInfoGain = $bdmig,minInstancesPerNode = $bdmipn")
2.2 尋優結果
決策樹驗證集分類準確率 = 0.823021582733813 決策樹測試集分類準確率 = 0.7894736842105263 決策樹分類耗時:17698 countVectorizer模型最優引數: vocabSize = 262144,minDF = 1.0 決策樹分類模型最優引數: mpurity = entropy,maxDepth = 10,maxBins = 32,minInfoGain = 0.1,minInstancesPerNode = 5
3.隨機森林引數尋優
3.1 程式碼
val rfStartTime = new Date().getTime
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(123456L)
val rfPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,rf,converts))
// 交叉驗證引數設定和模型
val rfParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(rf.impurity,Array("entropy","gini"))
.addGrid(rf.maxDepth,Array(5,10))
.addGrid(rf.maxBins,Array(32,500))
.addGrid(rf.minInfoGain,Array(0.1,0.01))
.addGrid(rf.minInstancesPerNode,Array(5,10))
.addGrid(rf.numTrees,Array(20,50))
.addGrid(rf.subsamplingRate,Array(0.2,0.1))
.build()
val rfCv = new CrossValidator()
.setEstimator(rfPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(rfParamGrid)
.setNumFolds(2)
.setParallelism(9)
val rfModel = rfCv.fit(train)
val rfValiad = rfModel.transform(train)
val rfPredictions = rfModel.transform(test)
val accuracyRft = evaluator.evaluate(rfValiad)
println(s"隨機森林驗證集分類準確率為:$accuracyRft")
val accuracyRfv = evaluator.evaluate(rfPredictions)
println(s"隨機森林測試集分類準確率為:$accuracyRfv")
val rfEndTime = new Date().getTime
val rfCostTime = (rfEndTime - rfStartTime)/rfParamGrid.length
println(s"隨機森林分類耗時:$rfCostTime")
// 獲取最優模型
val bestRfModel = rfModel.bestModel.asInstanceOf[PipelineModel]
val bestRfVectorizer = bestRfModel.stages(3).asInstanceOf[CountVectorizerModel]
val brvv = bestRfVectorizer.getVocabSize
val brvm = bestRfVectorizer.getMinDF
val bestRf = bestRfModel.stages(4).asInstanceOf[RandomForestClassificationModel]
val bri = bestRf.getImpurity
val brmd = bestRf.getMaxDepth
val brmb = bestRf.getMaxBins
val brmig = bestRf.getMinInfoGain
val brmipn = bestRf.getMinInstancesPerNode
val brnt = bestRf.getNumTrees
val brsr = bestRf.getSubsamplingRate
println(s"countVectorizer模型最優引數:\nvocabSize = $brvv,eminDF = $brvm\n隨機森林分類模型最優引數:\nmpurity = $bri,maxDepth = $brmd,maxBins = $brmb,minInfoGain = $brmig,minInstancesPerNode = $brmipn,numTrees = $brnt,subsamplingRate = $brsr")
3.2 尋優結果
隨機森林驗證集分類準確率為:0.9510791366906475 隨機森林測試集分類準確率為:0.8140350877192983 隨機森林分類耗時:15715 countVectorizer模型最優引數: vocabSize = 1024,minDF = 2.0 隨機森林分類模型最優引數: mpurity = gini,maxDepth = 10,maxBins = 32,minInfoGain = 0.01,minInstancesPerNode = 5,numTrees = 50,subsamplingRate = 0.2
4.XGBOOST引數尋優
4.1 程式碼
val xgbStartTime = new Date().getTime
val xgb = new XGBoostClassifier()
.setObjective("multi:softprob")
.setNumClass(outputLayers)
.setFeaturesCol("features")
.setLabelCol("label")
.setNumWorkers(1)
val xgbPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,xgb,converts))
// 交叉驗證引數設定和模型
val xgbParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(xgb.eta,Array(0.3,0.1))
.addGrid(xgb.maxDepth,Array(6,10))
.addGrid(xgb.numRound,Array(10,100))
.addGrid(xgb.alpha,Array(0.1,0.0))
.build()
val xgbCv = new CrossValidator()
.setEstimator(xgbPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(xgbParamGrid)
.setNumFolds(2)
.setParallelism(6)
val xgbModel = xgbCv.fit(train)
val xgbValiad = xgbModel.transform(train)
val xgbPredictions = xgbModel.transform(test)
val accuracyXgbt = evaluator.evaluate(xgbValiad)
println(s"xgboost驗證集分類準確率為:$accuracyXgbt")
val accuracyXgbv = evaluator.evaluate(xgbPredictions)
println(s"xgboost測試集分類準確率為:$accuracyXgbv")
val xgbEndTime = new Date().getTime
val xgbCostTime = (xgbEndTime - xgbStartTime)/xgbParamGrid.length
println(s"xgboost分類耗時:$xgbCostTime")
// 獲取最優模型
val bestXgbModel = xgbModel.bestModel.asInstanceOf[PipelineModel]
val bestXgbVectorizer = bestXgbModel.stages(3).asInstanceOf[CountVectorizerModel]
val bxvv = bestXgbVectorizer.getVocabSize
val bxvm = bestXgbVectorizer.getMinDF
val bestXgb = bestXgbModel.stages(4).asInstanceOf[XGBoostClassificationModel]
val bxe = bestXgb.getEta
val bxmd = bestXgb.getMaxDepth
val bxnr = bestXgb.getNumRound
val bxa = bestXgb.getAlpha
println(s"countVectorizer模型最優引數:\nvocabSize = $bxvv,minDF = $bxvm\nXGBOOST分類模型最優引數:\neta = $bxe,maxDepth = $bxmd,numRound = $bxnr,alpha = $bxa")
4.2 尋優結果
xgboost驗證集分類準確率為:1.0 xgboost測試集分類準確率為:0.8654970760233918 xgboost分類耗時:32023 countVectorizer模型最優引數: vocabSize = 262144,minDF = 2.0 XGBOOST分類模型最優引數: eta = 0.1,maxDepth = 6,numRound = 100,alpha = 0.0
5.樸素貝葉斯引數尋優
5.1 程式碼
val nvbStartTime = new Date().getTime
val nvb = new NaiveBayes()
val nvbPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,nvb,converts))
// 交叉驗證引數設定和模型
val nvbParamGrid = new ParamGridBuilder()
.addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
.addGrid(vectorizer.minDF,Array(1.0,2.0))
.addGrid(nvb.smoothing,Array(1, 0.5))
.build()
val nvbCv = new CrossValidator()
.setEstimator(nvbPipeline)
.setEvaluator(new MulticlassClassificationEvaluator)
.setEstimatorParamMaps(nvbParamGrid)
.setNumFolds(2)
.setParallelism(3)
val nvbModel = nvbCv.fit(train)
val nvbValiad = nvbModel.transform(train)
val nvbPredictions = nvbModel.transform(test)
val accuracyNvbt = evaluator.evaluate(nvbValiad)
println(s"樸素貝葉斯驗證集分類準確率:$accuracyNvbt")
val accuracyNvbv = evaluator.evaluate(nvbPredictions)
println(s"樸素貝葉斯測試集分類準確率:$accuracyNvbv")
val nvbEndTime = new Date().getTime
val nvbCostTime = (nvbEndTime - nvbStartTime)/nvbParamGrid.length
println(s"樸素貝葉斯分類耗時:$nvbCostTime")
// 獲取最優模型
val bestNvbModel = nvbModel.bestModel.asInstanceOf[PipelineModel]
val bestNvbVectorizer = bestNvbModel.stages(3).asInstanceOf[CountVectorizerModel]
val bnvv = bestNvbVectorizer.getVocabSize
val bnvm = bestNvbVectorizer.getMinDF
val bestNvb = bestNvbModel.stages(4).asInstanceOf[NaiveBayesModel]
val bns = bestNvb.getSmoothing
println(s"countVectorizer模型最優引數:\nvocabSize = $bnvv,minDF = $bnvm\n樸素貝葉斯分類模型最優引數:\nsmoothing = $bns")
5.2 尋優結果
樸素貝葉斯驗證集分類準確率:0.9280575539568345 樸素貝葉斯測試集分類準確率:0.7192982456140351 樸素貝葉斯分類耗時:10987 countVectorizer模型最優引數: vocabSize = 262144,minDF = 2.0 樸素貝葉斯分類模型最優引數: smoothing = 0.5
三、優化模型效能對比
交叉驗證可以提升模型分類準確率(如下圖所示),XGBOOST優化後在驗證集和測試集都獲取最佳的分類效果,但耗時反而增加了。邏輯迴歸、決策樹、隨機森林和樸素貝葉斯變換優化後,驗證集分類正確率和測試集分類正確率都有一定程度提高,並且耗時有所下降。
驗證集分類正確率 | 測試集分類正確率 | 耗時(ms) | ||
邏輯迴歸 | 優化前 | 100% | 79.53% | 44697 |
優化後 | 100% | 81.87% | 17404 | |
決策樹 | 優化前 | 81.58% | 73.68% | 34597 |
優化後 | 82.30% | 78.94% | 17698 | |
隨機森林 | 優化前 | 94.24% | 73.68% | 56608 |
優化後 | 95.10% | 81.40% | 15715 | |
XGBOOST | 優化前 | 99.71% | 79.53% | 31947 |
優化後 | 100% | 86.54% | 32023 | |
樸素貝葉斯分類 | 優化前 | 83.74% | 71.34% | 11510 |
優化後 | 92.80% | 71.92% | 10987 |
參考文獻