1. 程式人生 > >基於邏輯迴歸/決策樹/隨機森林/多層感知分類器/xgboost/樸素貝葉斯分類的資訊多分類效能對比

基於邏輯迴歸/決策樹/隨機森林/多層感知分類器/xgboost/樸素貝葉斯分類的資訊多分類效能對比

在上一篇(https://blog.csdn.net/baymax_007/article/details/82748544)中,利用邏輯迴歸實現資訊多分類。本文在之前基礎上,又引入決策樹、隨機森林、多層感知分類器、xgboost和樸素貝葉斯分類演算法,並對驗證集和測試集分類正確率和耗時進行效能對比。

ml支援決策樹、隨機森林、梯度提升決策樹(GBDT)、線性支援向量機(LSVC)、多層感知分類器(MPC,簡單神經網路)和樸素貝葉斯分類,可以直接使用。需要注意,梯度提升決策樹和線性支援向量機在spark mllib2.3.1版本中暫時不支援多分類,本文先不對兩者作對比。xgboost4j-spark中封裝支援java和scala版本的xgboost,可以直接使用。

一、環境

java 1.8.0_172+scala 2.11.8+spark 2.3.1+HanLP portable-1.6.8+xgboost-spark 0.80

	<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.3.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.3.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>2.3.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.3.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/com.hankcs/hanlp -->
        <dependency>
            <groupId>com.hankcs</groupId>
            <artifactId>hanlp</artifactId>
            <version>portable-1.6.8</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/ml.dmlc/xgboost4j-spark -->
        <dependency>
            <groupId>ml.dmlc</groupId>
            <artifactId>xgboost4j-spark</artifactId>
            <version>0.80</version>
        </dependency>

二、實驗設計

spark ml支援pipeline,可以將特徵提取轉換、分類模型一起組裝到pipeline中,通過對pipeline的訓練建模,並在統一分類評估準則下,進行演算法對比,從而簡化程式碼冗餘,如下圖所示。

而上一篇HanLP分詞無法組裝到pipeline中,因此需要自定義ml包Tokenizer繼承UnaryTransformer類,並重寫UnaryTransformFunc,實現基於HanLP分詞、可以組裝到pipeline功能。

 三、程式碼實現

1. 自定義HanLP實現pipeline封裝

ml自帶Tokenizer可以封裝到pipeline中,檢視程式碼,發現其繼承UnaryTransformer類,並重寫UnaryTransformFunc方法,這也是分詞的核心方法,outputDataType和valiadateInputType可以約束輸出型別和輸入型別校驗。

class HanLPTokenizer(override val uid:String) extends UnaryTransformer[String, Seq[String], HanLPTokenizer] {

    private var segmentType = "StandardTokenizer"
    private var enableNature = false

    def setSegmentType(value:String):this.type = {
      this.segmentType = value
      this
    }

    def enableNature(value:Boolean):this.type  = {
      this.enableNature = value
      this
    }

    def this() = this(Identifiable.randomUID("HanLPTokenizer"))

    override protected def createTransformFunc: String => Seq[String] = {
      hanLP
    }

    private def hanLP(line:String): Seq[String] = {
      var terms: Seq[Term] = Seq()
      import collection.JavaConversions._
      segmentType match {
        case "StandardSegment" =>
          terms = StandardTokenizer.segment(line)
        case "NLPSegment" =>
          terms = NLPTokenizer.segment(line)
        case "IndexSegment" =>
          terms = IndexTokenizer.segment(line)
        case "SpeedSegment" =>
          terms = SpeedTokenizer.segment(line)
        case "NShortSegment" =>
          terms = new NShortSegment().seg(line)
        case "CRFlexicalAnalyzer" =>
          terms = new CRFLexicalAnalyzer().seg(line)
        case _ =>
          println("分詞型別錯誤!")
          System.exit(1)
      }
      val termSeq = terms.map(term =>
      if(this.enableNature) term.toString else term.word)
      termSeq
    }

    override protected def validateInputType(inputType: DataType): Unit = {
      require(inputType == DataTypes.StringType,
        s"Input type must be string type but got $inputType.")
    }
    override protected def outputDataType: DataType = new ArrayType(StringType, true)
  }

2. 特徵工程程式碼

主要包含有:標籤索引轉換,本文分詞,去除停用詞、關鍵詞頻數特徵提取和預測索引標籤還原。

val indexer = new StringIndexer()
      .setInputCol("tab")
      .setOutputCol("label")
      .fit(peopleNews)

    val segmenter = new HanLPTokenizer()
      .setInputCol("content")
      .setOutputCol("tokens")
      .enableNature(false)
      .setSegmentType("StandardSegment")

    val stopwords = spark.read.textFile("/opt/data/stopwordsCH.txt").collect()

    val remover = new StopWordsRemover()
      .setStopWords(stopwords)
      .setInputCol("tokens")
      .setOutputCol("removed")

    val vectorizer = new CountVectorizer()
      .setVocabSize(1000)
      .setInputCol("removed")
      .setOutputCol("features")


    val converts = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictionTab")
      .setLabels(indexer.labels)

3.邏輯迴歸程式碼

    val lr = new LogisticRegression()
      .setMaxIter(40)
      .setTol(1e-7)
      .setLabelCol("label")
      .setFeaturesCol("features")

    val lrStartTime = new Date().getTime
    val lrPipeline = new Pipeline()
        .setStages(Array(indexer,segmenter,remover,vectorizer,lr,converts))

    val Array(train,test) = peopleNews.randomSplit(Array(0.8,0.2),12L)

    val lrModel = lrPipeline.fit(train)
    val lrValiad = lrModel.transform(train)
    val lrPredictions = lrModel.transform(test)

    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("label")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")

    val accuracyLrt = evaluator.evaluate(lrValiad)
    println(s"邏輯迴歸驗證集分類準確率 = $accuracyLrt")
    val accuracyLrv = evaluator.evaluate(lrPredictions)
    println(s"邏輯迴歸測試集分類準確率 = $accuracyLrv")
    val lrEndTime = new Date().getTime
    val lrCostTime = lrEndTime - lrStartTime
    println(s"邏輯迴歸分類耗時:$lrCostTime")

 4.決策樹程式碼

//    訓練決策樹模型
    val dtStartTime = new Date().getTime
    val dt = new DecisionTreeClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setImpurity("entropy") // 不純度
      .setMaxBins(1000) // 離散化"連續特徵"的最大劃分數
      .setMaxDepth(10) // 樹的最大深度
      .setMinInfoGain(0.01) //一個節點分裂的最小資訊增益,值為[0,1]
      .setMinInstancesPerNode(5) //每個節點包含的最小樣本數
      .setSeed(123456L)

    val dtPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,dt,converts))

    val dtModel = dtPipeline.fit(train)
    val dtValiad = dtModel.transform(train)
    val dtPredictions = dtModel.transform(test)
    val accuracyDtt = evaluator.evaluate(dtValiad)
    println(s"決策樹驗證集分類準確率 = $accuracyDtt")
    val accuracyDtv = evaluator.evaluate(dtPredictions)
    println(s"決策樹測試集分類準確率 = $accuracyDtv")
    val dtEndTime = new Date().getTime
    val dtCostTime = dtEndTime - dtStartTime
    println(s"決策樹分類耗時:$dtCostTime")

5. 隨機森林程式碼

//    訓練隨機森林模型
    val rfStartTime = new Date().getTime
    val rf = new RandomForestClassifier()
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setImpurity("entropy") // 不純度
      .setMaxBins(1000) // 離散化"連續特徵"的最大劃分數
      .setMaxDepth(10) // 樹的最大深度
      .setMinInfoGain(0.01) //一個節點分裂的最小資訊增益,值為[0,1]
      .setMinInstancesPerNode(5) //每個節點包含的最小樣本數
      .setNumTrees(100)
      .setSeed(123456L)

    val rfPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,rf,converts))
    val rfModel = rfPipeline.fit(train)
    val rfValiad = rfModel.transform(train)
    val rfPredictions = rfModel.transform(test)
    val accuracyRft = evaluator.evaluate(rfValiad)
    println(s"隨機森林驗證集分類準確率為:$accuracyRft")
    val accuracyRfv = evaluator.evaluate(rfPredictions)
    println(s"隨機森林測試集分類準確率為:$accuracyRfv")
    val rfEndTime = new Date().getTime
    val rfCostTime = rfEndTime - rfStartTime
    println(s"隨機森林分類耗時:$rfCostTime")

6. 多層感知分類器程式碼

多層感知分類器(簡單神經網路)網路節點設定可以參考:

 m:輸入層節點個數,n:輸出層節點個數,h1:第一層隱含層節點個數=log2(m),h2:第一層隱含層節點個數=sqrt(m+n)+a,其中a取1-10

 //    多層感知分類器
    val inputLayers = vectorizer.getVocabSize
    val hideLayer1 = Math.round(Math.log(inputLayers)/Math.log(2)).toInt
    val outputLayers = peopleNews.select("tab").distinct().count().toInt
    val hideLayer2 = Math.round(Math.sqrt(inputLayers + outputLayers) + 1).toInt
    val layers = Array[Int](inputLayers, hideLayer1, hideLayer2, outputLayers)
    val mpcstartTime = new Date().getTime
    val mpc = new MultilayerPerceptronClassifier()
      .setLayers(layers)
      .setBlockSize(128)
      .setTol(1e-7)
      .setMaxIter(100)
      .setLabelCol("label")
      .setFeaturesCol("features")
      .setSeed(1234L)

    val mpcPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,mpc,converts))
    val mpcModel = mpcPipeline.fit(train)
    val mpcValiad = mpcModel.transform(train)
    val mpcPredictions = mpcModel.transform(test)
    val accuracyMpct = evaluator.evaluate(mpcValiad)
    println(s"多層感知分類器驗證集分類準確率:$accuracyMpct")
    val accuracyMpcv = evaluator.evaluate(mpcPredictions)
    println(s"多層感知分類器測試集分類準確率:$accuracyMpcv")
    val mpcEndTime = new Date().getTime
    val mpcCostTime = mpcEndTime - mpcstartTime
    println(s"多層感知分類器分類耗時:$mpcCostTime")

7. XGBOOST程式碼

//    xgboost訓練模型
    val xgbParam = Map("eta" -> 0.1f,
      "max_depth" -> 10, //數的最大深度。預設值為6 ,取值範圍為:[1,∞]
      "objective" -> "multi:softprob",  //定義學習任務及相應的學習目標
      "num_class" -> outputLayers,
      "num_round" -> 10,
      "num_workers" -> 1)
    val xgbStartTime = new Date().getTime
    val xgb = new XGBoostClassifier(xgbParam).
      setFeaturesCol("features").
      setLabelCol("label")

    val xgbPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,xgb,converts))
    val xgbModel = xgbPipeline.fit(train)
    val xgbValiad = xgbModel.transform(train)
    val xgbPredictions = xgbModel.transform(test)
    val accuracyXgbt = evaluator.evaluate(xgbValiad)
    println(s"xgboost驗證集分類準確率為:$accuracyXgbt")
    val accuracyXgbv = evaluator.evaluate(xgbPredictions)
    println(s"xgboost測試集分類準確率為:$accuracyXgbv")
    val xgbEndTime = new Date().getTime
    val xgbCostTime = xgbEndTime - xgbStartTime
    println(s"xgboost分類耗時:$xgbCostTime")

8. 樸素貝葉斯程式碼

//    樸素貝葉斯分類
    val nvbStartTime = new Date().getTime
    val nvb = new NaiveBayes()
    val nvbPipeline = new Pipeline()
      .setStages(Array(indexer,segmenter,remover,vectorizer,nvb,converts))
    val nvbModel = nvbPipeline.fit(train)
    val nvbValiad = nvbModel.transform(train)
    val nvbPredictions = nvbModel.transform(test)
    val accuracyNvbt = evaluator.evaluate(nvbValiad)
    println(s"樸素貝葉斯驗證集分類準確率:$accuracyNvbt")
    val accuracyNvbv = evaluator.evaluate(nvbPredictions)
    println(s"樸素貝葉斯測試集分類準確率:$accuracyNvbv")
    val nvbEndTime = new Date().getTime
    val nvbCostTime = nvbEndTime - nvbStartTime
    println(s"樸素貝葉斯分類耗時:$nvbCostTime")

四、效能對比

驗證集分類準確率 測試集分類準確率 耗時(ms)
邏輯迴歸(LR) 100% 79.53% 44697
決策樹(DT) 81.58% 73.68% 34597
隨機森林(RF) 94.24% 73.68% 56608
多層感知分類器(MPC) 97.98% 68.42% 30801
XGBOOST 99.71% 79.53% 31947
樸素貝葉斯分類(NVB) 83.74% 71.34% 11510

以上演算法,設計引數調優會在稍後進行嘗試。

參考文獻