Spark機器學習之模型選擇和超引數調整

阿新 • • 發佈：2019-02-09

模型選擇（超引數調諧）

ML中的一個重要任務是模型選擇，或使用資料找到給定任務的最佳模型或引數。這也叫調音。可以針對個體估算器（如Logistic迴歸）或包括多個演算法，特徵化和其他步驟的整個管道完成調整。使用者可以一次調整整個流水線，而不是單獨調整管道中的每個元素。

MLlib支援使用CrossValidator和TrainValidationSplit等工具進行模型選擇。這些工具需要以下專案：

Estimator：演算法或管道調整
Set of ParamMaps：可供選擇的引數，有時稱為“引數網格”進行搜尋
Evaluator：衡量擬合模型對延伸測試資料有多好的度量

在高層次上，這些模型選擇工具的工作如下：

他們將輸入資料分成單獨的訓練和測試資料集。
對於每個（訓練，測試）對，遍歷一組ParamMaps：
對於每個ParamMap，它們使用這些引數適合Estimator，獲得擬合的Model，並使用Evaluator評估Model的效能。
選擇由最佳效能引數組合生成的模型。

評估者可以是迴歸問題的迴歸估值器，二進位制資料的BinaryClassificationEvaluator或多類問題的MulticlassClassificationEvaluator。用於選擇最佳ParamMap的預設度量可以被這些評估器中的每一個的setMetricName方法覆蓋。為了幫助構建引數網格，使用者可以使用ParamGridBuilder實用程式。

交叉驗證
CrossValidator首先將資料集分成一組摺疊，這些摺疊用作單獨的訓練和測試資料集。例如，k = 3倍，CrossValidator將生成3個（訓練，測試）資料集對，每個資料集使用2/3的資料進行訓練，1/3進行測試。為了評估一個特定的ParamMap，CrossValidator通過在3個不同的（訓練，測試）資料集對上擬合Estimator來計算3個模型的平均評估度量。
在確定最佳ParamMap之後，CrossValidator最終使用最好的ParamMap和整個資料集重新擬合Estimator。

示例：通過交叉驗證進行模型選擇
以下示例演示如何使用CrossValidator從引數網格中進行選擇。
請注意，通過引數網格的交叉驗證是昂貴的。例如，在下面的示例中，引數網格具有3個值，用於hashingTF.numFeatures，2個值用於lr.regParam，CrossValidator使用2個摺疊。這被乘以（3×2）×2 = 12個不同的模型被訓練。在現實的設定中，嘗試更多引數並使用更多的摺疊（k = 3，k = 10是常見的）是常見的。換句話說，使用CrossValidator可能非常昂貴。然而，它也是一種成熟的方法，用於選擇比啟發式手動調諧更具統計學意義的引數。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)

Train-Validation Split

除了CrossValidator Spark，還提供了用於超引數調整的TrainValidationSplit。 TrainValidationSplit僅對引數的每個組合進行一次評估，而在CrossValidator的情況下，則不是k次。因此，它較便宜，但在訓練資料集不夠大時不會產生可靠的結果。
與CrossValidator不同，TrainValidationSplit建立一個（訓練，測試）資料集對。它使用trainRatio引數將資料集分成這兩個部分。例如，trainRatio = 0.75，TrainValidationSplit將生成訓練和測試資料集對，其中75％的資料用於訓練，25％用於驗證。

Example: model selection via train validation split

from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit

# Prepare training and test data.
data = spark.read.format("libsvm")\
    .load("data/mllib/sample_linear_regression_data.txt")
train, test = data.randomSplit([0.9, 0.1], seed=12345)

lr = LinearRegression(maxIter=10)

# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using
# the evaluator.
paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.fitIntercept, [False, True])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()

# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
tvs = TrainValidationSplit(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=RegressionEvaluator(),
                           # 80% of the data will be used for training, 20% for validation.
                           trainRatio=0.8)

# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)

# Make predictions on test data. model is the model with combination of parameters
# that performed best.
model.transform(test)\
    .select("features", "label", "prediction")\
    .show()

Spark機器學習之模型選擇和超引數調整

Spark機器學習之模型選擇和超引數調整

兩種模型選擇和超引數調整方法及Spark MLlib使用示例(Scala/Java/Python)

Spark2.0機器學習系列之1：基於Pipeline、交叉驗證、ParamMap的模型選擇和超引數調優

機器學習之模型選擇（K折交叉驗證，超引數的選擇）

系統學習機器學習之模型選擇

模型評估和超引數調整（二）——交叉驗證（cross validation）

模型評估和超引數調整（一）——管道機制（pipeline）

機器學習：模型評估和選擇

AI工程師成長之路--機器學習之模型評估與選擇

Spark機器學習之特徵提取、選擇、轉換

python 機器學習中模型評估和調參

機器學習之擬合和過擬合問題

機器學習之---class-aware 和 class-agnostic

機器學習之采樣和變分（八）

機器學習之Apriori演算法和FP-growth演算法

吳恩達機器學習之聚類演算法的引數選擇以及優化

機器學習之模型評估與引數調優

【機器學習】機器學習之特徵選擇

機器學習之模型評估（還是沒搞定...）

機器學習之特徵選擇方法整理

Spark機器學習之模型選擇和超引數調整

相關推薦