1. 程式人生 > >Cross-validation優化資訊多分類模型引數

Cross-validation優化資訊多分類模型引數

在上一篇(https://blog.csdn.net/baymax_007/article/details/82775180)中,只是簡單選用邏輯迴歸、決策樹、隨機森林、多層感知分類器、xgboost、樸素貝葉斯分類對資訊進行分類。然而,實際分類模型效果受模型初始化的引數影響,選取合適引數可以提高分類效果。

交叉驗證(Cross-validation)是常用的模型引數優化方法。CrossValidator將資料集劃分為若干子集分別地進行訓練和測試。如當k=3時,CrossValidator產生3個訓練資料與測試資料對,每個資料對使用2/3的資料來訓練,1/3的資料來測試。對於一組特定的引數表,CrossValidator計算基於三組不同訓練資料與測試資料對訓練得到的模型的評估準則的平均值。確定最佳引數表後,CrossValidator最後使用最佳引數表基於全部資料來重新擬合估計器。

spark ml提供ParamBuilder來重構模型搜尋引數網路,CrossValidator例項可以設定搜尋網路引數param和模型pipelines。

一、試驗設計

1. 特徵提取引數選取

CountVectorizer將分詞後關鍵詞轉換為數值型的關鍵詞頻數特徵,涉及詞彙表的最大含量vocabsize和詞彙表中的詞語至少要在多少個不同文的那個中出現的引數minDF。

2.分類模型引數選擇

2.1 邏輯迴歸

regParam:正則化係數,防止過擬合

elasticNetParam:正則化正規化比,L1:L2

2.2 決策樹

impurity:不純度,可選型別entropy,gini

maxDepth:樹的最大深度

maxBins:離散化連續特徵的最大劃分數

minInfoGain:一個節點分裂的最小資訊增益比

minInstancesPerNode:每個節點包含的最小樣本數

2.3 隨機森林

impurity:純度度量函式,可選型別entropy,gini

maxDepth:樹的最大深度

maxBins:連續特徵離散化的最大劃分數

minInfoGain:一個節點分裂的最小資訊增益比

minInstancesPerNode:每個節點包含的最小樣本數

numTrees:bootstrap訓練需要的樹個數

subsamplingRate:訓練森林時每次取樣的比例

2.4 XGBOOST

eta:學習速率,預設0.3

maxDepth:樹最大深度

numRound:boosting次數

apha:L1正則係數

2.5 樸素貝葉斯分類

smoothing:平滑引數

3.交叉驗證

ml包提供交叉驗證尋找最佳引數api,使用ParamGrid通過addGrid支援對pipeline模型的網路引數設定,並在CrossValidator訓練中尋找最優引數。

4.具體設計

二、程式碼實現

1.邏輯迴歸引數尋優

1.1 程式碼

val lrStartTime = new Date().getTime
    val lrPipeline = new Pipeline()
        .setStages(Array(indexer,segmenter,remover,vectorizer,lr,converts))
//    交叉驗證引數設定和模型
    val lrParamGrid = new ParamGridBuilder()
      .addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
      .addGrid(vectorizer.minDF,Array(1.0,2.0))
      .addGrid(lr.regParam,Array(0.1,0.01))
      .addGrid(lr.elasticNetParam,Array(0.1,0.0))
      .build()

    val lrCv = new CrossValidator()
      .setEstimator(lrPipeline)
      .setEvaluator(new MulticlassClassificationEvaluator)
      .setEstimatorParamMaps(lrParamGrid)
      .setNumFolds(2)
      .setParallelism(4)

    val lrModel = lrCv.fit(train)
    val lrValiad = lrModel.transform(train)
    val lrPredictions = lrModel.transform(test)

    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")

    val accuracyLrt = evaluator.evaluate(lrValiad)
    println(s"邏輯迴歸驗證集分類準確率 = $accuracyLrt")
    val accuracyLrv = evaluator.evaluate(lrPredictions)
    println(s"邏輯迴歸測試集分類準確率 = $accuracyLrv")
    val lrEndTime = new Date().getTime
    val lrCostTime = (lrEndTime - lrStartTime)/lrParamGrid.length
    println(s"邏輯迴歸分類耗時:$lrCostTime")
//    獲取最優模型
    val bestLrModel = lrModel.bestModel.asInstanceOf[PipelineModel]
    val bestLrVectorizer = bestLrModel.stages(3).asInstanceOf[CountVectorizerModel]
    val blvv = bestLrVectorizer.getVocabSize
    val blvm = bestLrVectorizer.getMinDF
    val bestLr = bestLrModel.stages(4).asInstanceOf[LogisticRegressionModel]
    val blr = bestLr.getRegParam
    val ble = bestLr.getElasticNetParam
    println(s"countVectorizer模型最優引數:\ngetVocabSize= $blvv,minDF = $blvm\n邏輯迴歸模型最優引數:\nregParam = $blr,elasticNetParam = $ble")

1.2 尋優結果

邏輯迴歸驗證集分類準確率 = 1.0 邏輯迴歸測試集分類準確率 = 0.8187134502923976 邏輯迴歸分類耗時:17404 countVectorizer模型最優引數: vocabSize = 262144,minDF = 2.0 邏輯迴歸模型最優引數: regParam = 0.01,elasticNetParam = 0.1 

2.決策樹引數尋優

2.1 程式碼 

 val dtStartTime = new Date().getTime
    val dt = new DecisionTreeClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setSeed(123456L)

    val dtPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,dt,converts))

//    交叉驗證引數設定和模型
    val dtParamGrid = new ParamGridBuilder()
      .addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
      .addGrid(vectorizer.minDF,Array(1.0,2.0))
      .addGrid(dt.impurity,Array("entropy","gini"))
      .addGrid(dt.maxDepth,Array(5,10))
      .addGrid(dt.maxBins,Array(32,500))
      .addGrid(dt.minInfoGain,Array(0.1,0.01))
      .addGrid(dt.minInstancesPerNode,Array(5,10))
      .build()

    val dtCv = new CrossValidator()
      .setEstimator(dtPipeline)
      .setEvaluator(new MulticlassClassificationEvaluator)
      .setEstimatorParamMaps(dtParamGrid)
      .setNumFolds(2)
      .setParallelism(7)

    val dtModel = dtCv.fit(train)
    val dtValiad = dtModel.transform(train)
    val dtPredictions = dtModel.transform(test)
    val accuracyDtt = evaluator.evaluate(dtValiad)
    println(s"決策樹驗證集分類準確率 = $accuracyDtt")
    val accuracyDtv = evaluator.evaluate(dtPredictions)
    println(s"決策樹測試集分類準確率 = $accuracyDtv")
    val dtEndTime = new Date().getTime
    val dtCostTime = (dtEndTime - dtStartTime)/dtParamGrid.length
    println(s"決策樹分類耗時:$dtCostTime")

//    獲取最優模型
    val bestDtModel = dtModel.bestModel.asInstanceOf[PipelineModel]
    val bestDtVectorizer = bestDtModel.stages(3).asInstanceOf[CountVectorizerModel]
    val bdvv = bestDtVectorizer.getVocabSize
    val bdvm = bestDtVectorizer.getMinDF
    val bestDt = bestDtModel.stages(4).asInstanceOf[DecisionTreeClassificationModel]
    val bdi = bestDt.getImpurity
    val bdmd = bestDt.getMaxDepth
    val bdmb = bestDt.getMaxBins
    val bdmig = bestDt.getMinInfoGain
    val bdmipn = bestDt.getMinInstancesPerNode
    println(s"countVectorizer模型最優引數:\nvocabSize = $bdvv,minDF = $bdvm\n決策樹分類模型最優引數:\nmpurity = $bdi,maxDepth = $bdmd,maxBins = $bdmb,minInfoGain = $bdmig,minInstancesPerNode = $bdmipn")

2.2 尋優結果

決策樹驗證集分類準確率 = 0.823021582733813 決策樹測試集分類準確率 = 0.7894736842105263 決策樹分類耗時:17698 countVectorizer模型最優引數: vocabSize = 262144,minDF = 1.0 決策樹分類模型最優引數: mpurity = entropy,maxDepth = 10,maxBins = 32,minInfoGain = 0.1,minInstancesPerNode = 5 

3.隨機森林引數尋優

3.1 程式碼

 val rfStartTime = new Date().getTime
    val rf = new RandomForestClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setSeed(123456L)

    val rfPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,rf,converts))

//    交叉驗證引數設定和模型
    val rfParamGrid = new ParamGridBuilder()
      .addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
      .addGrid(vectorizer.minDF,Array(1.0,2.0))
      .addGrid(rf.impurity,Array("entropy","gini"))
      .addGrid(rf.maxDepth,Array(5,10))
      .addGrid(rf.maxBins,Array(32,500))
      .addGrid(rf.minInfoGain,Array(0.1,0.01))
      .addGrid(rf.minInstancesPerNode,Array(5,10))
      .addGrid(rf.numTrees,Array(20,50))
      .addGrid(rf.subsamplingRate,Array(0.2,0.1))
      .build()

    val rfCv = new CrossValidator()
      .setEstimator(rfPipeline)
      .setEvaluator(new MulticlassClassificationEvaluator)
      .setEstimatorParamMaps(rfParamGrid)
      .setNumFolds(2)
      .setParallelism(9)

    val rfModel = rfCv.fit(train)
    val rfValiad = rfModel.transform(train)
    val rfPredictions = rfModel.transform(test)
    val accuracyRft = evaluator.evaluate(rfValiad)
    println(s"隨機森林驗證集分類準確率為:$accuracyRft")
    val accuracyRfv = evaluator.evaluate(rfPredictions)
    println(s"隨機森林測試集分類準確率為:$accuracyRfv")
    val rfEndTime = new Date().getTime
    val rfCostTime = (rfEndTime - rfStartTime)/rfParamGrid.length
    println(s"隨機森林分類耗時:$rfCostTime")

//    獲取最優模型
    val bestRfModel = rfModel.bestModel.asInstanceOf[PipelineModel]
    val bestRfVectorizer = bestRfModel.stages(3).asInstanceOf[CountVectorizerModel]
    val brvv = bestRfVectorizer.getVocabSize
    val brvm = bestRfVectorizer.getMinDF
    val bestRf = bestRfModel.stages(4).asInstanceOf[RandomForestClassificationModel]
    val bri = bestRf.getImpurity
    val brmd = bestRf.getMaxDepth
    val brmb = bestRf.getMaxBins
    val brmig = bestRf.getMinInfoGain
    val brmipn = bestRf.getMinInstancesPerNode
    val brnt = bestRf.getNumTrees
    val brsr = bestRf.getSubsamplingRate
    println(s"countVectorizer模型最優引數:\nvocabSize = $brvv,eminDF = $brvm\n隨機森林分類模型最優引數:\nmpurity = $bri,maxDepth = $brmd,maxBins = $brmb,minInfoGain = $brmig,minInstancesPerNode = $brmipn,numTrees = $brnt,subsamplingRate = $brsr")

3.2 尋優結果

隨機森林驗證集分類準確率為:0.9510791366906475 隨機森林測試集分類準確率為:0.8140350877192983 隨機森林分類耗時:15715 countVectorizer模型最優引數: vocabSize = 1024,minDF = 2.0 隨機森林分類模型最優引數: mpurity = gini,maxDepth = 10,maxBins = 32,minInfoGain = 0.01,minInstancesPerNode = 5,numTrees = 50,subsamplingRate = 0.2 

4.XGBOOST引數尋優

4.1 程式碼

val xgbStartTime = new Date().getTime
    val xgb = new XGBoostClassifier()
      .setObjective("multi:softprob")
      .setNumClass(outputLayers)
      .setFeaturesCol("features")
      .setLabelCol("label")
      .setNumWorkers(1)

    val xgbPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,xgb,converts))

//    交叉驗證引數設定和模型
    val xgbParamGrid = new ParamGridBuilder()
      .addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
      .addGrid(vectorizer.minDF,Array(1.0,2.0))
      .addGrid(xgb.eta,Array(0.3,0.1))
      .addGrid(xgb.maxDepth,Array(6,10))
      .addGrid(xgb.numRound,Array(10,100))
      .addGrid(xgb.alpha,Array(0.1,0.0))
      .build()

    val xgbCv = new CrossValidator()
      .setEstimator(xgbPipeline)
      .setEvaluator(new MulticlassClassificationEvaluator)
      .setEstimatorParamMaps(xgbParamGrid)
      .setNumFolds(2)
      .setParallelism(6)

    val xgbModel = xgbCv.fit(train)
    val xgbValiad = xgbModel.transform(train)
    val xgbPredictions = xgbModel.transform(test)
    val accuracyXgbt = evaluator.evaluate(xgbValiad)
    println(s"xgboost驗證集分類準確率為:$accuracyXgbt")
    val accuracyXgbv = evaluator.evaluate(xgbPredictions)
    println(s"xgboost測試集分類準確率為:$accuracyXgbv")
    val xgbEndTime = new Date().getTime
    val xgbCostTime = (xgbEndTime - xgbStartTime)/xgbParamGrid.length
    println(s"xgboost分類耗時:$xgbCostTime")


//    獲取最優模型
    val bestXgbModel = xgbModel.bestModel.asInstanceOf[PipelineModel]
    val bestXgbVectorizer = bestXgbModel.stages(3).asInstanceOf[CountVectorizerModel]
    val bxvv = bestXgbVectorizer.getVocabSize
    val bxvm = bestXgbVectorizer.getMinDF
    val bestXgb = bestXgbModel.stages(4).asInstanceOf[XGBoostClassificationModel]
    val bxe = bestXgb.getEta
    val bxmd = bestXgb.getMaxDepth
    val bxnr = bestXgb.getNumRound
    val bxa = bestXgb.getAlpha
    println(s"countVectorizer模型最優引數:\nvocabSize = $bxvv,minDF = $bxvm\nXGBOOST分類模型最優引數:\neta = $bxe,maxDepth = $bxmd,numRound = $bxnr,alpha = $bxa")

4.2 尋優結果

xgboost驗證集分類準確率為:1.0 xgboost測試集分類準確率為:0.8654970760233918 xgboost分類耗時:32023 countVectorizer模型最優引數: vocabSize = 262144,minDF = 2.0 XGBOOST分類模型最優引數: eta = 0.1,maxDepth = 6,numRound = 100,alpha = 0.0 

5.樸素貝葉斯引數尋優

5.1 程式碼

 val nvbStartTime = new Date().getTime
    val nvb = new NaiveBayes()
    val nvbPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,nvb,converts))

//    交叉驗證引數設定和模型
    val nvbParamGrid = new ParamGridBuilder()
      .addGrid(vectorizer.vocabSize,Array(1<<10,1<<18))
      .addGrid(vectorizer.minDF,Array(1.0,2.0))
      .addGrid(nvb.smoothing,Array(1, 0.5))
      .build()

    val nvbCv = new CrossValidator()
      .setEstimator(nvbPipeline)
      .setEvaluator(new MulticlassClassificationEvaluator)
      .setEstimatorParamMaps(nvbParamGrid)
      .setNumFolds(2)
      .setParallelism(3)

    val nvbModel = nvbCv.fit(train)
    val nvbValiad = nvbModel.transform(train)
    val nvbPredictions = nvbModel.transform(test)
    val accuracyNvbt = evaluator.evaluate(nvbValiad)
    println(s"樸素貝葉斯驗證集分類準確率:$accuracyNvbt")
    val accuracyNvbv = evaluator.evaluate(nvbPredictions)
    println(s"樸素貝葉斯測試集分類準確率:$accuracyNvbv")
    val nvbEndTime = new Date().getTime
    val nvbCostTime = (nvbEndTime - nvbStartTime)/nvbParamGrid.length
    println(s"樸素貝葉斯分類耗時:$nvbCostTime")

//    獲取最優模型
    val bestNvbModel = nvbModel.bestModel.asInstanceOf[PipelineModel]
    val bestNvbVectorizer = bestNvbModel.stages(3).asInstanceOf[CountVectorizerModel]
    val bnvv = bestNvbVectorizer.getVocabSize
    val bnvm = bestNvbVectorizer.getMinDF
    val bestNvb = bestNvbModel.stages(4).asInstanceOf[NaiveBayesModel]
    val bns = bestNvb.getSmoothing
    println(s"countVectorizer模型最優引數:\nvocabSize = $bnvv,minDF = $bnvm\n樸素貝葉斯分類模型最優引數:\nsmoothing = $bns")

5.2 尋優結果

樸素貝葉斯驗證集分類準確率:0.9280575539568345 樸素貝葉斯測試集分類準確率:0.7192982456140351 樸素貝葉斯分類耗時:10987 countVectorizer模型最優引數: vocabSize = 262144,minDF = 2.0 樸素貝葉斯分類模型最優引數: smoothing = 0.5

三、優化模型效能對比

交叉驗證可以提升模型分類準確率(如下圖所示),XGBOOST優化後在驗證集和測試集都獲取最佳的分類效果,但耗時反而增加了。邏輯迴歸、決策樹、隨機森林和樸素貝葉斯變換優化後,驗證集分類正確率和測試集分類正確率都有一定程度提高,並且耗時有所下降。

驗證集分類正確率 測試集分類正確率 耗時(ms)
邏輯迴歸 優化前 100% 79.53% 44697
優化後 100% 81.87% 17404
決策樹 優化前 81.58% 73.68% 34597
優化後 82.30% 78.94% 17698
隨機森林 優化前 94.24% 73.68% 56608
優化後 95.10% 81.40% 15715
XGBOOST 優化前 99.71% 79.53% 31947
優化後 100% 86.54% 32023
樸素貝葉斯分類 優化前 83.74% 71.34% 11510
優化後 92.80% 71.92% 10987

參考文獻