Spark 模型選擇和調參

阿新 • • 發佈：2020-09-28

# Spark - ML Tuning 官方文件：https://spark.apache.org/docs/2.2.0/ml-tuning.html 這一章節主要講述如何通過使用MLlib的工具來除錯模型演算法和pipeline，內建的交叉驗證和其他工具允許使用者優化模型和pipeline中的超引數；目錄： - 模型選擇，也就是調參； - 交叉驗證； - 訓練集、驗證集劃分； ### 模型選擇（調參）機器學習的一個重要工作就是模型選擇，或者說根據給定任務使用資料來發現最優的模型和引數，也叫做除錯，既可以針對單個模型進行除錯，也可以針對整個pipeline的各個環節進行除錯，使用者可以一次對整個pipeline進行除錯而不是每次一個pipeline中的部分； MLlib支援CrossValidator和TrainValidationSplit等模型選擇工具，這些工具需要下列引數： - Estimator：待除錯的演算法或者Pipeline； - 引數Map列表：用於搜尋的引數空間； - Evaluator：衡量模型在集外測試集上表現的方法；這些工具工作方式如下： - 分割資料到訓練集和測試集； - 對每一組訓練&測試資料，應用所有引數空間中的可選引數組合： - 對每一組引數組合，使用其設定到演算法上，得到對應的model，並驗證該model的效能； - 選擇得到最好效能的模型使用的引數組合； Evaluator針對迴歸問題可以是RegressionEvaluator，針對二分資料可以是BinaryClassificationEvaluator，針對多分類問題的MulticlassClassificationEvaluator，預設的驗證方法可以通過setMetricName來修改； ### 交叉驗證 CrossValidator首先將資料分到一個個的fold中，使用這些fold集合作為訓練集和測試集，如果k=3，那麼CrossValidator將生成3個（訓練，測試）組合，也就是通過3個fold排列組合得到的，每一組使用2個fold作為訓練集，另一個fold作為測試集，為了驗證一個指定的引數組合，CrossValidator需要計算3個模型的平均效能，每個模型都是通過之前的一組訓練&測試集訓練得到；確認了最佳引數後，CrossValidator最終會使用全部資料和最佳引數組合來重新訓練預測；例子：通過交叉驗證進行模型選擇；注意：交叉驗證在整個引數網格上是十分耗時的，下面的例子中，引數網格中numFeatures有3個可取值，regParam有2個可取值，CrossValidator使用2個fold，這將會訓練3\*2\*2個不同的模型，在實際工作中，通常會設定更多的引數、更多的引數取值以及更多的fold，換句話說，CrossValidator本身就是十分奢侈的，無論如何，與手工除錯相比，它依然是一種更加合理和自動化的調參手段； ```python from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.ml.tuning import CrossValidator, ParamGridBuilder # Prepare training documents, which are labeled. training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0), (4, "b spark who", 1.0), (5, "g d a y", 0.0), (6, "spark fly", 1.0), (7, "was mapreduce", 0.0), (8, "e spark program", 1.0), (9, "a e c l", 0.0), (10, "spark compile", 1.0), (11, "hadoop software", 0.0) ], ["id", "text", "label"]) # Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) # We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance. # This will allow us to jointly choose parameters for all Pipeline stages. # A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. # We use a ParamGridBuilder to construct a grid of parameters to search over. # With 3 values for hashingTF.numFeatures and 2 values for lr.regParam, # this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from. paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice # Run cross-validation, and choose the best set of parameters. cvModel = crossval.fit(training) # Prepare test documents, which are unlabeled. test = spark.createDataFrame([ (4, "spark i j k"), (5, "l m n"), (6, "mapreduce spark"), (7, "apache hadoop") ], ["id", "text"]) # Make predictions on test documents. cvModel uses the best model found (lrModel). prediction = cvModel.transform(test) selected = prediction.select("id", "text", "probability", "prediction") for row in selected.collect(): print(row) ``` ### 劃分訓練、驗證集對於超引數除錯，Spark還支援TrainValidationSplit，它一次只能驗證一組引數，這與CrossValidator一次進行k次截然不同，因此它更加快速，但是如果訓練集不夠大的化就無法得到一個真實的結果；不像是CrossValidator，TrainValidationSplit建立一個訓練、測試組合，它根據trainRatio將資料分為兩部分，假設trainRatio=0.75，那麼資料集的75%作為訓練集，25%用於驗證；與CrossValidator類似的是，TrainValidationSplit最終也會使用最佳引數和全部資料來訓練一個預測器； ```python from pyspark.ml.evaluation import RegressionEvaluator from pyspark.ml.regression import LinearRegression from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit # Prepare training and test data. data = spark.read.format("libsvm")\ .load("data/mllib/sample_linear_regression_data.txt") train, test = data.randomSplit([0.9, 0.1], seed=12345) lr = LinearRegression(maxIter=10) # We use a ParamGridBuilder to construct a grid of parameters to search over. # TrainValidationSplit will try all combinations of values and determine best model using # the evaluator. paramGrid = ParamGridBuilder()\ .addGrid(lr.regParam, [0.1, 0.01]) \ .addGrid(lr.fitIntercept, [False, True])\ .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\ .build() # In this case the estimator is simply the linear regression. # A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. tvs = TrainValidationSplit(estimator=lr, estimatorParamMaps=paramGrid, evaluator=RegressionEvaluator(), # 80% of the data will be used for training, 20% for validation. trainRatio=0.8) # Run TrainValidationSplit, and choose the best set of parameters. model = tvs.fit(train) # Make predictions on test data. model is the model with combination of parameters # that performed best. model.transform(test)\ .select("features", "label", "prediction")\ .sho

Spark 模型選擇和調參

Spark 模型選擇和調參

python 機器學習中模型評估和調參

人工智慧（3）- 模型評估和調參

機器學習模型選擇：調參引數選擇

SVM 的核函式選擇和調參

兩種模型選擇和超引數調整方法及Spark MLlib使用示例(Scala/Java/Python)

Spark2.0機器學習系列之1：基於Pipeline、交叉驗證、ParamMap的模型選擇和超引數調優

Spark機器學習之模型選擇和超引數調整

###好好好好########深度模型優化效能調參

風險大腦-支付風險識別天池大賽（三）快速建立模型（含調參思路）

模型選擇與調優

學習筆記（七）模型的調參之網格搜尋和交叉驗證的簡單應用

xgboost的介紹和模型調參

機器學習調參-模型選擇

機器學習：模型評估和選擇

Spark-Sql整合hive，在spark-sql命令和spark-shell命令下執行sql命令和整合調用hive

基於pytorch的CNN、LSTM神經網絡模型調參小結

【技術翻譯】支持向量機簡明教程及其在python和R下的調參

C# 控制臺應用程序從外部傳參運行和調試

隨機森林和決策樹調參

Spark 模型選擇和調參

相關推薦