SparkMLlib回歸算法之決策樹
阿新 • • 發佈:2017-05-24
ria 之間 feature 輸入 修改 決策樹算法 技術 color 實例
SparkMLlib回歸算法之決策樹
(一),決策樹概念
1,決策樹算法(ID3,C4.5 ,CART)之間的比較:
1,ID3算法在選擇根節點和各內部節點中的分支屬性時,采用信息增益作為評價標準。信息增益的缺點是傾向於選擇取值較多的屬性,在有些情況下這類屬性可能不會提供太多有價值的信息。
2 ID3算法只能對描述屬性為離散型屬性的數據集構造決策樹,其余兩種算法對離散和連續都可以處理
2,C4.5算法實例介紹(參考網址:http://m.blog.csdn.net/article/details?id=44726921)
c4.5後剪枝策略:以悲觀剪枝為主參考網址:http://www.cnblogs.com/zhangchaoyang/articles/2842490.html
(二) SparkMLlib決策樹回歸的應用
1,數據集來源及描述:參考http://www.cnblogs.com/ksWorld/p/6891664.html
2,代碼實現:
2.1 構建輸入數據格式:
val file_bike = "hour_nohead.csv" val file_tree=sc.textFile(file_bike).map(_.split(",")).map{ x => val feature=x.slice(2,x.length-3).map(_.toDouble) val label=x(x.length-1).toDouble LabeledPoint(label,Vectors.dense(feature)) } println(file_tree.first()) val categoricalFeaturesInfo= Map[Int,Int]() val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32)
2.2 模型評判標準(mse,mae,rmsle)
val predict_vs_train = file_tree.map { point => (model_DT.predict(point.features),point.label) /* point => (math.exp(model_DT.predict(point.features)), math.exp(point.label))*/ } predict_vs_train.take(5).foreach(println(_)) /*MSE是均方誤差*/ val mse = predict_vs_train.map(x => math.pow(x._1 - x._2, 2)).mean() /* 平均絕對誤差(MAE)*/ val mae = predict_vs_train.map(x => math.abs(x._1 - x._2)).mean() /*均方根對數誤差(RMSLE)*/ val rmsle = math.sqrt(predict_vs_train.map(x => math.pow(math.log(x._1 + 1) - math.log(x._2 + 1), 2)).mean()) println(s"mse is $mse and mae is $mae and rmsle is $rmsle") /* mse is 11611.485999495755 and mae is 71.15018786490428 and rmsle is 0.6251152586960916 */
(三) 改進模型性能和參數調優
1,改變目標量 (對目標值求根號),修改下面語句
LabeledPoint(math.log(label),Vectors.dense(feature)) 和 val predict_vs_train = file_tree.map { /*point => (model_DT.predict(point.features),point.label)*/ point => (math.exp(model_DT.predict(point.features)), math.exp(point.label)) } /*結果 mse is 14781.575988339053 and mae is 76.41310991122032 and rmsle is 0.6405996100717035 */
決策樹在變換後的性能有所下降
2,模型參數調優
1,構建訓練集和測試集
val file_tree=sc.textFile(file_bike).map(_.split(",")).map{ x => val feature=x.slice(2,x.length-3).map(_.toDouble) val label=x(x.length-1).toDouble LabeledPoint(label,Vectors.dense(feature)) /*LabeledPoint(math.log(label),Vectors.dense(feature))*/ } val tree_orgin=file_tree.randomSplit(Array(0.8,0.2),11L) val tree_train=tree_orgin(0) val tree_test=tree_orgin(1)
2,調節樹的深度參數
val categoricalFeaturesInfo = Map[Int,Int]() val model_DT=DecisionTree.trainRegressor(file_tree,categoricalFeaturesInfo,"variance",5,32) /*調節樹深度次數*/ val Deep_Results = Seq(1, 2, 3, 4, 5, 10, 20).map { param => val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",param,32) val scoreAndLabels = tree_test.map { point => (model.predict(point.features), point.label) } val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean) (s"$param lambda", rmsle) } /*深度的結果輸出*/ Deep_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")} /* 1 lambda, rmsle = 1.0763369409492645 2 lambda, rmsle = 0.9735820606349874 3 lambda, rmsle = 0.8786984993014815 4 lambda, rmsle = 0.8052113493915528 5 lambda, rmsle = 0.7014036913077335 10 lambda, rmsle = 0.44747906135994925 20 lambda, rmsle = 0.4769214752638845 */
深度較大的決策樹出現過擬合,從結果來看這個數據集最優的樹深度大概在10左右
3,調節劃分數
/*調節劃分數*/ val ClassNum_Results = Seq(2, 4, 8, 16, 32, 64, 100).map { param => val model = DecisionTree.trainRegressor(tree_train, categoricalFeaturesInfo,"variance",10,param) val scoreAndLabels = tree_test.map { point => (model.predict(point.features), point.label) } val rmsle = math.sqrt(scoreAndLabels.map(x => math.pow(math.log(x._1) - math.log(x._2), 2)).mean) (s"$param lambda", rmsle) } /*劃分數的結果輸出*/ ClassNum_Results.foreach { case (param, rmsl) => println(f"$param, rmsle = ${rmsl}")} /* 2 lambda, rmsle = 1.2995002615220668 4 lambda, rmsle = 0.7682777577495858 8 lambda, rmsle = 0.6615110909041817 16 lambda, rmsle = 0.4981237727958235 32 lambda, rmsle = 0.44747906135994925 64 lambda, rmsle = 0.4487531073836407 100 lambda, rmsle = 0.4487531073836407 */
更多的劃分數會使模型變復雜,並且有助於提升特征維度較大的模型性能。劃分數到一定程度之後,對性能的提升幫助不大。實際上,由於過擬合的原因會導致測試集的性能變差。可見分類數應在32左右。。
SparkMLlib回歸算法之決策樹