基於邏輯迴歸/決策樹/隨機森林/多層感知分類器/xgboost/樸素貝葉斯分類的資訊多分類效能對比
在上一篇(https://blog.csdn.net/baymax_007/article/details/82748544)中,利用邏輯迴歸實現資訊多分類。本文在之前基礎上,又引入決策樹、隨機森林、多層感知分類器、xgboost和樸素貝葉斯分類演算法,並對驗證集和測試集分類正確率和耗時進行效能對比。
ml支援決策樹、隨機森林、梯度提升決策樹(GBDT)、線性支援向量機(LSVC)、多層感知分類器(MPC,簡單神經網路)和樸素貝葉斯分類,可以直接使用。需要注意,梯度提升決策樹和線性支援向量機在spark mllib2.3.1版本中暫時不支援多分類,本文先不對兩者作對比。xgboost4j-spark中封裝支援java和scala版本的xgboost,可以直接使用。
一、環境
java 1.8.0_172+scala 2.11.8+spark 2.3.1+HanLP portable-1.6.8+xgboost-spark 0.80
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.11</artifactId> <version>2.3.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-sql_2.11</artifactId> <version>2.3.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-hive --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-hive_2.11</artifactId> <version>2.3.1</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib --> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-mllib_2.11</artifactId> <version>2.3.1</version> </dependency> <!-- https://mvnrepository.com/artifact/com.hankcs/hanlp --> <dependency> <groupId>com.hankcs</groupId> <artifactId>hanlp</artifactId> <version>portable-1.6.8</version> </dependency> <!-- https://mvnrepository.com/artifact/ml.dmlc/xgboost4j-spark --> <dependency> <groupId>ml.dmlc</groupId> <artifactId>xgboost4j-spark</artifactId> <version>0.80</version> </dependency>
二、實驗設計
spark ml支援pipeline,可以將特徵提取轉換、分類模型一起組裝到pipeline中,通過對pipeline的訓練建模,並在統一分類評估準則下,進行演算法對比,從而簡化程式碼冗餘,如下圖所示。
而上一篇HanLP分詞無法組裝到pipeline中,因此需要自定義ml包Tokenizer繼承UnaryTransformer類,並重寫UnaryTransformFunc,實現基於HanLP分詞、可以組裝到pipeline功能。
三、程式碼實現
1. 自定義HanLP實現pipeline封裝
ml自帶Tokenizer可以封裝到pipeline中,檢視程式碼,發現其繼承UnaryTransformer類,並重寫UnaryTransformFunc方法,這也是分詞的核心方法,outputDataType和valiadateInputType可以約束輸出型別和輸入型別校驗。
class HanLPTokenizer(override val uid:String) extends UnaryTransformer[String, Seq[String], HanLPTokenizer] {
private var segmentType = "StandardTokenizer"
private var enableNature = false
def setSegmentType(value:String):this.type = {
this.segmentType = value
this
}
def enableNature(value:Boolean):this.type = {
this.enableNature = value
this
}
def this() = this(Identifiable.randomUID("HanLPTokenizer"))
override protected def createTransformFunc: String => Seq[String] = {
hanLP
}
private def hanLP(line:String): Seq[String] = {
var terms: Seq[Term] = Seq()
import collection.JavaConversions._
segmentType match {
case "StandardSegment" =>
terms = StandardTokenizer.segment(line)
case "NLPSegment" =>
terms = NLPTokenizer.segment(line)
case "IndexSegment" =>
terms = IndexTokenizer.segment(line)
case "SpeedSegment" =>
terms = SpeedTokenizer.segment(line)
case "NShortSegment" =>
terms = new NShortSegment().seg(line)
case "CRFlexicalAnalyzer" =>
terms = new CRFLexicalAnalyzer().seg(line)
case _ =>
println("分詞型別錯誤!")
System.exit(1)
}
val termSeq = terms.map(term =>
if(this.enableNature) term.toString else term.word)
termSeq
}
override protected def validateInputType(inputType: DataType): Unit = {
require(inputType == DataTypes.StringType,
s"Input type must be string type but got $inputType.")
}
override protected def outputDataType: DataType = new ArrayType(StringType, true)
}
2. 特徵工程程式碼
主要包含有:標籤索引轉換,本文分詞,去除停用詞、關鍵詞頻數特徵提取和預測索引標籤還原。
val indexer = new StringIndexer()
.setInputCol("tab")
.setOutputCol("label")
.fit(peopleNews)
val segmenter = new HanLPTokenizer()
.setInputCol("content")
.setOutputCol("tokens")
.enableNature(false)
.setSegmentType("StandardSegment")
val stopwords = spark.read.textFile("/opt/data/stopwordsCH.txt").collect()
val remover = new StopWordsRemover()
.setStopWords(stopwords)
.setInputCol("tokens")
.setOutputCol("removed")
val vectorizer = new CountVectorizer()
.setVocabSize(1000)
.setInputCol("removed")
.setOutputCol("features")
val converts = new IndexToString()
.setInputCol("prediction")
.setOutputCol("predictionTab")
.setLabels(indexer.labels)
3.邏輯迴歸程式碼
val lr = new LogisticRegression()
.setMaxIter(40)
.setTol(1e-7)
.setLabelCol("label")
.setFeaturesCol("features")
val lrStartTime = new Date().getTime
val lrPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,lr,converts))
val Array(train,test) = peopleNews.randomSplit(Array(0.8,0.2),12L)
val lrModel = lrPipeline.fit(train)
val lrValiad = lrModel.transform(train)
val lrPredictions = lrModel.transform(test)
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracyLrt = evaluator.evaluate(lrValiad)
println(s"邏輯迴歸驗證集分類準確率 = $accuracyLrt")
val accuracyLrv = evaluator.evaluate(lrPredictions)
println(s"邏輯迴歸測試集分類準確率 = $accuracyLrv")
val lrEndTime = new Date().getTime
val lrCostTime = lrEndTime - lrStartTime
println(s"邏輯迴歸分類耗時:$lrCostTime")
4.決策樹程式碼
// 訓練決策樹模型
val dtStartTime = new Date().getTime
val dt = new DecisionTreeClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setImpurity("entropy") // 不純度
.setMaxBins(1000) // 離散化"連續特徵"的最大劃分數
.setMaxDepth(10) // 樹的最大深度
.setMinInfoGain(0.01) //一個節點分裂的最小資訊增益,值為[0,1]
.setMinInstancesPerNode(5) //每個節點包含的最小樣本數
.setSeed(123456L)
val dtPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,dt,converts))
val dtModel = dtPipeline.fit(train)
val dtValiad = dtModel.transform(train)
val dtPredictions = dtModel.transform(test)
val accuracyDtt = evaluator.evaluate(dtValiad)
println(s"決策樹驗證集分類準確率 = $accuracyDtt")
val accuracyDtv = evaluator.evaluate(dtPredictions)
println(s"決策樹測試集分類準確率 = $accuracyDtv")
val dtEndTime = new Date().getTime
val dtCostTime = dtEndTime - dtStartTime
println(s"決策樹分類耗時:$dtCostTime")
5. 隨機森林程式碼
// 訓練隨機森林模型
val rfStartTime = new Date().getTime
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
.setImpurity("entropy") // 不純度
.setMaxBins(1000) // 離散化"連續特徵"的最大劃分數
.setMaxDepth(10) // 樹的最大深度
.setMinInfoGain(0.01) //一個節點分裂的最小資訊增益,值為[0,1]
.setMinInstancesPerNode(5) //每個節點包含的最小樣本數
.setNumTrees(100)
.setSeed(123456L)
val rfPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,rf,converts))
val rfModel = rfPipeline.fit(train)
val rfValiad = rfModel.transform(train)
val rfPredictions = rfModel.transform(test)
val accuracyRft = evaluator.evaluate(rfValiad)
println(s"隨機森林驗證集分類準確率為:$accuracyRft")
val accuracyRfv = evaluator.evaluate(rfPredictions)
println(s"隨機森林測試集分類準確率為:$accuracyRfv")
val rfEndTime = new Date().getTime
val rfCostTime = rfEndTime - rfStartTime
println(s"隨機森林分類耗時:$rfCostTime")
6. 多層感知分類器程式碼
多層感知分類器(簡單神經網路)網路節點設定可以參考:
m:輸入層節點個數,n:輸出層節點個數,h1:第一層隱含層節點個數=log2(m),h2:第一層隱含層節點個數=sqrt(m+n)+a,其中a取1-10
// 多層感知分類器
val inputLayers = vectorizer.getVocabSize
val hideLayer1 = Math.round(Math.log(inputLayers)/Math.log(2)).toInt
val outputLayers = peopleNews.select("tab").distinct().count().toInt
val hideLayer2 = Math.round(Math.sqrt(inputLayers + outputLayers) + 1).toInt
val layers = Array[Int](inputLayers, hideLayer1, hideLayer2, outputLayers)
val mpcstartTime = new Date().getTime
val mpc = new MultilayerPerceptronClassifier()
.setLayers(layers)
.setBlockSize(128)
.setTol(1e-7)
.setMaxIter(100)
.setLabelCol("label")
.setFeaturesCol("features")
.setSeed(1234L)
val mpcPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,mpc,converts))
val mpcModel = mpcPipeline.fit(train)
val mpcValiad = mpcModel.transform(train)
val mpcPredictions = mpcModel.transform(test)
val accuracyMpct = evaluator.evaluate(mpcValiad)
println(s"多層感知分類器驗證集分類準確率:$accuracyMpct")
val accuracyMpcv = evaluator.evaluate(mpcPredictions)
println(s"多層感知分類器測試集分類準確率:$accuracyMpcv")
val mpcEndTime = new Date().getTime
val mpcCostTime = mpcEndTime - mpcstartTime
println(s"多層感知分類器分類耗時:$mpcCostTime")
7. XGBOOST程式碼
// xgboost訓練模型
val xgbParam = Map("eta" -> 0.1f,
"max_depth" -> 10, //數的最大深度。預設值為6 ,取值範圍為:[1,∞]
"objective" -> "multi:softprob", //定義學習任務及相應的學習目標
"num_class" -> outputLayers,
"num_round" -> 10,
"num_workers" -> 1)
val xgbStartTime = new Date().getTime
val xgb = new XGBoostClassifier(xgbParam).
setFeaturesCol("features").
setLabelCol("label")
val xgbPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,xgb,converts))
val xgbModel = xgbPipeline.fit(train)
val xgbValiad = xgbModel.transform(train)
val xgbPredictions = xgbModel.transform(test)
val accuracyXgbt = evaluator.evaluate(xgbValiad)
println(s"xgboost驗證集分類準確率為:$accuracyXgbt")
val accuracyXgbv = evaluator.evaluate(xgbPredictions)
println(s"xgboost測試集分類準確率為:$accuracyXgbv")
val xgbEndTime = new Date().getTime
val xgbCostTime = xgbEndTime - xgbStartTime
println(s"xgboost分類耗時:$xgbCostTime")
8. 樸素貝葉斯程式碼
// 樸素貝葉斯分類
val nvbStartTime = new Date().getTime
val nvb = new NaiveBayes()
val nvbPipeline = new Pipeline()
.setStages(Array(indexer,segmenter,remover,vectorizer,nvb,converts))
val nvbModel = nvbPipeline.fit(train)
val nvbValiad = nvbModel.transform(train)
val nvbPredictions = nvbModel.transform(test)
val accuracyNvbt = evaluator.evaluate(nvbValiad)
println(s"樸素貝葉斯驗證集分類準確率:$accuracyNvbt")
val accuracyNvbv = evaluator.evaluate(nvbPredictions)
println(s"樸素貝葉斯測試集分類準確率:$accuracyNvbv")
val nvbEndTime = new Date().getTime
val nvbCostTime = nvbEndTime - nvbStartTime
println(s"樸素貝葉斯分類耗時:$nvbCostTime")
四、效能對比
驗證集分類準確率 | 測試集分類準確率 | 耗時(ms) | |
邏輯迴歸(LR) | 100% | 79.53% | 44697 |
決策樹(DT) | 81.58% | 73.68% | 34597 |
隨機森林(RF) | 94.24% | 73.68% | 56608 |
多層感知分類器(MPC) | 97.98% | 68.42% | 30801 |
XGBOOST | 99.71% | 79.53% | 31947 |
樸素貝葉斯分類(NVB) | 83.74% | 71.34% | 11510 |
以上演算法,設計引數調優會在稍後進行嘗試。
參考文獻