Spark ML離線訓練模型用於線上預測

阿新 • • 發佈：2018-12-18

最近公司有需求需要將離線訓練好的演算法模型應用到線上去實時預測，線上預測不考慮feature加工的情況下，經調研，發現jpmml-sparkml+jpmml-evaluator的方式可以滿足條件。不過使用時需要注意該框架是AGPL-3.0協議。

方案：spark ml + jpmml-sparkml + jpmml-evaluator

Spark離線訓練Random Forest模型並儲存為pmml格式：

import java.io.FileOutputStream
import javax.xml.transform.stream.StreamResult
import com.jd.risk.utils.HadoopFileUtil
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.spark.examples.ml.DecisionTreeExample
import org.apache.spark.examples.mllib.AbstractParams
import org.apache.spark.ml.classification.{RandomForestClassificationModel, RandomForestClassifier}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{StringIndexer, VectorAssembler, VectorIndexer}
import org.apache.spark.ml.regression.{RandomForestRegressionModel, RandomForestRegressor}
import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.jpmml.model.{JAXBUtil, MetroJAXBUtil}
import org.jpmml.sparkml.ConverterUtil
import scopt.OptionParser
import scala.collection.mutable
import scala.language.reflectiveCalls
/**
  * Created by sjmei on 2017/01/19.
  */
object RandomForestPMMLTask {
  case class Params(
      input: String = null,
      modelDir: String = null,
      taskType:String = "train",
      testInput: String = "",
      dataFormat: String = "libsvm",
      algo: String = "classification",
      maxDepth: Int = 4,
      maxBins: Int = 32,
      minInstancesPerNode: Int = 1,
      minInfoGain: Double = 0.0,
      numTrees: Int = 5,
      featureSubsetStrategy: String = "auto",
      fracTest: Double = 0.2,
      cacheNodeIds: Boolean = false,
      checkpointDir: Option[String] = None,
      checkpointInterval: Int = 10) extends AbstractParams[Params]
  def main(args: Array[String]) {
    val defaultParams = Params()
    val parser = new OptionParser[Params]("RandomForestExample") {
      head("RandomForestExample: an example random forest app.")
      opt[String]("algo")
        .text(s"algorithm (classification, regression), default: ${defaultParams.algo}")
        .action((x, c) => c.copy(algo = x))
      opt[String]("taskType")
        .text(s"modelType, default: ${defaultParams.taskType}")
        .action((x, c) => c.copy(taskType = x))
      opt[Int]("maxDepth")
        .text(s"max depth of the tree, default: ${defaultParams.maxDepth}")
        .action((x, c) => c.copy(maxDepth = x))
      opt[Int]("maxBins")
        .text(s"max number of bins, default: ${defaultParams.maxBins}")
        .action((x, c) => c.copy(maxBins = x))
      opt[Int]("minInstancesPerNode")
        .text(s"min number of instances required at child nodes to create the parent split," +
        s" default: ${defaultParams.minInstancesPerNode}")
        .action((x, c) => c.copy(minInstancesPerNode = x))
      opt[Double]("minInfoGain")
        .text(s"min info gain required to create a split, default: ${defaultParams.minInfoGain}")
        .action((x, c) => c.copy(minInfoGain = x))
      opt[Int]("numTrees")
        .text(s"number of trees in ensemble, default: ${defaultParams.numTrees}")
        .action((x, c) => c.copy(numTrees = x))
      opt[String]("featureSubsetStrategy")
        .text(s"number of features to use per node (supported:" +
        s" ${RandomForestClassifier.supportedFeatureSubsetStrategies.mkString(",")})," +
        s" default: ${defaultParams.numTrees}")
        .action((x, c) => c.copy(featureSubsetStrategy = x))
      opt[Double]("fracTest")
        .text(s"fraction of data to hold out for testing. If given option testInput, " +
        s"this option is ignored. default: ${defaultParams.fracTest}")
        .action((x, c) => c.copy(fracTest = x))
      opt[Boolean]("cacheNodeIds")
        .text(s"whether to use node Id cache during training, " +
        s"default: ${defaultParams.cacheNodeIds}")
        .action((x, c) => c.copy(cacheNodeIds = x))
      opt[String]("checkpointDir")
        .text(s"checkpoint directory where intermediate node Id caches will be stored, " +
        s"default: ${
          defaultParams.checkpointDir match {
            case Some(strVal) => strVal
            case None => "None"
          }
        }")
        .action((x, c) => c.copy(checkpointDir = Some(x)))
      opt[Int]("checkpointInterval")
        .text(s"how often to checkpoint the node Id cache, " +
        s"default: ${defaultParams.checkpointInterval}")
        .action((x, c) => c.copy(checkpointInterval = x))
      opt[String]("testInput")
        .text(s"input path to test dataset. If given, option fracTest is ignored." +
        s" default: ${defaultParams.testInput}")
        .action((x, c) => c.copy(testInput = x))
      opt[String]("dataFormat")
        .text("data format: libsvm (default), dense (deprecated in Spark v1.1)")
        .action((x, c) => c.copy(dataFormat = x))
      arg[String]("<input>")
        .text("input path to labeled examples")
        .required()
        .action((x, c) => c.copy(input = x))
      arg[String]("<modelDir>")
        .text("modelDir path to labeled examples")
        .required()
        .action((x, c) => c.copy(modelDir = x))
      checkConfig { params =>
        if (params.fracTest < 0 || params.fracTest >= 1) {
          failure(s"fracTest ${params.fracTest} value incorrect; should be in [0,1).")
        } else {
          success
        }
      }
    }
    parser.parse(args, defaultParams) match {
      case Some(params) => {
        if(params.taskType.equalsIgnoreCase("train")){
          train(params)
        }
      }
      case _ => sys.exit(1)
    }
  }
  def train(params: Params): Unit = {
    val spark = SparkSession
      .builder
        .master("local")
      .appName(s"RandomForestExample with $params")
      .getOrCreate()
    params.checkpointDir.foreach(spark.sparkContext.setCheckpointDir)
    val algo = params.algo.toLowerCase
    println(s"RandomForestExample with parameters:\n$params")
    // Load training and test data and cache it.
    val (training: DataFrame, test: DataFrame) = AlgoUtils.loadMaliceDataFrame(spark.sparkContext, params.input, params.fracTest)
    // Set up Pipeline.
    val stages = new mutable.ArrayBuffer[PipelineStage]()
    // (1) For classification, re-index classes.
    val labelColName = if (algo == "classification") "indexedLabel" else "label"
    if (algo == "classification") {
      val labelIndexer = new StringIndexer()
        .setInputCol("label")
        .setOutputCol(labelColName)
      stages += labelIndexer
    }
    val vectorAssember = new VectorAssembler()
    vectorAssember.setInputCols(Array("degree","tcNum","pageRank","commVertexNum","normQ","gtRate","eqRate","ltRate"))
    vectorAssember.setOutputCol("features")
    stages += vectorAssember
    // (2) Identify categorical features using VectorIndexer.
    //     Features with more than maxCategories values will be treated as continuous.
    val featuresIndexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(10)
    stages += featuresIndexer
    // (3) Learn Random Forest.
    val dt = algo match {
      case "classification" =>
        new RandomForestClassifier()
          .setFeaturesCol("features")
          .setLabelCol(labelColName)
          .setMaxDepth(params.maxDepth)
          .setMaxBins(params.maxBins)
          .setMinInstancesPerNode(params.minInstancesPerNode)
          .setMinInfoGain(params.minInfoGain)
          .setCacheNodeIds(params.cacheNodeIds)
          .setCheckpointInterval(params.checkpointInterval)
          .setFeatureSubsetStrategy(params.featureSubsetStrategy)
          .setNumTrees(params.numTrees)
      case "regression" =>
        new RandomForestRegressor()
          .setFeaturesCol("features")
          .setLabelCol(labelColName)
          .setMaxDepth(params.maxDepth)
          .setMaxBins(params.maxBins)
          .setMinInstancesPerNode(params.minInstancesPerNode)
          .setMinInfoGain(params.minInfoGain)
          .setCacheNodeIds(params.cacheNodeIds)
          .setCheckpointInterval(params.checkpointInterval)
          .setFeatureSubsetStrategy(params.featureSubsetStrategy)
          .setNumTrees(params.numTrees)
      case _ => throw new IllegalArgumentException("Algo ${params.algo} not supported.")
    }
    stages += dt
    val pipeline = new Pipeline().setStages(stages.toArray)
    // Fit the Pipeline.
    val startTime = System.nanoTime()
    val pipelineModel = pipeline.fit(training)
    val elapsedTime = (System.nanoTime() - startTime) / 1e9
    println(s"Training time: $elapsedTime seconds")
    val rfModel = pipelineModel.stages.last.asInstanceOf[RandomForestClassificationModel]
    /**
      * write model pmml format to hdfs
      */
    val modelPmmlPath = params.modelDir
    val pmml = ConverterUtil.toPMML(training.schema, pipelineModel);
    val conf = new Configuration();
    HadoopFileUtil.deleteFile(modelPmmlPath)
    val path = new Path(modelPmmlPath);
    val fs = path.getFileSystem(conf);
    val out = fs.create(path);
    MetroJAXBUtil.marshalPMML(pmml, out);
    MetroJAXBUtil.marshalPMML(pmml, new FileOutputStream(modelPmmlPath));
    JAXBUtil.marshalPMML(pmml, new StreamResult(System.out));
    val predictions = pipelineModel.transform(training)
    // Get the trained Random Forest from the fitted PipelineModel.
    algo match {
      case "classification" =>
        if (rfModel.totalNumNodes < 30) {
          println(rfModel.toDebugString) // Print full model.
        } else {
          println(rfModel) // Print model summary.
        }
      case "regression" =>
        val rfrModel = pipelineModel.stages.last.asInstanceOf[RandomForestRegressionModel]
        if (rfrModel.totalNumNodes < 30) {
          println(rfrModel.toDebugString) // Print full model.
        } else {
          println(rfrModel) // Print model summary.
        }
      case _ => throw new IllegalArgumentException("Algo ${params.algo} not supported.")
    }
    // Evaluate model on training, test data.
    algo match {
      case "classification" =>
        println("Training data results:")
        DecisionTreeExample.evaluateClassificationModel(pipelineModel, training, labelColName)
        val evaluator = new MulticlassClassificationEvaluator()
          .setLabelCol("indexedLabel")
          .setPredictionCol("prediction")
          .setMetricName("accuracy")
        val accuracy = evaluator.evaluate(predictions)
        println("Test Error = " + (1.0 - accuracy))
      case "regression" =>
        println("Training data results:")
        DecisionTreeExample.evaluateRegressionModel(pipelineModel, training, labelColName)
      case _ =>
        throw new IllegalArgumentException("Algo ${params.algo} not supported.")
    }
    predictions.printSchema()
    predictions.select("label","prediction","probability").show(10)
    spark.stop()
  }
}

jpmml-evaluator實現線上實時預測：

/**
 * Created by sjmei on 2017/1/19.
 */
public class PrdictScore {
    public static void main(String[] args) throws Exception {
        PMML pmml = readPMML(new File("data/pmmlmodel/rf.pmml"));
        ModelEvaluatorFactory modelEvaluatorFactory = ModelEvaluatorFactory.newInstance();
//        System.out.println(pmml.getModels().get(0));
        Evaluator evaluator = modelEvaluatorFactory.newModelEvaluator(pmml);
//        ModelEvaluator evaluator = new MiningModelEvaluator(pmml);
        evaluator.verify();
        List<InputField> inputFields = evaluator.getInputFields();
        InputStream is = new FileInputStream(new File("data/train.txt"));
        BufferedReader br = new BufferedReader(new InputStreamReader(is));
        String line;
        int diffDelta = 0;
        int sameDelta = 0;
        while((line = br.readLine()) != null) {
            String[] splits = line.split("\t",-1);
            double targetMs = transToDouble(splits[14]);
            double risk_value = transToDouble(splits[2]);
            double label = 0.0;
            if(targetMs==1.0 && risk_value >5.0d){
                label = 1.0;
            }
            LinkedHashMap<FieldName, FieldValue> arguments = readArgumentsFromLine(splits, inputFields);
            Map<FieldName, ?> results = evaluator.evaluate(arguments);
            List<TargetField> targetFields = evaluator.getTargetFields();
            for(TargetField targetField : targetFields){
                FieldName targetFieldName = targetField.getName();
                Object targetFieldValue = results.get(targetFieldName);
                ProbabilityDistribution nodeMap = (ProbabilityDistribution)targetFieldValue;
                Object result = nodeMap.getResult();
                if(label == Double.valueOf(result.toString())){
                    sameDelta +=1;
                }else{
                    diffDelta +=1;
                }
            }
        }
        System.out.println("acc count:"+sameDelta);
        System.out.println("error count:"+diffDelta);
        System.out.println("acc rate:"+(sameDelta*1.0d/(sameDelta+diffDelta)));
    }
    /**
     * 從檔案中讀取pmml模型檔案
     * @param file
     * @return
     * @throws Exception
     */
    public static PMML readPMML(File file) throws Exception {
        InputStream is = new FileInputStream(file);
        return PMMLUtil.unmarshal(is);
    }
    /**
     * 構造模型輸入特徵欄位
     * @param splits
     * @param inputFields
     * @return
     */
    public static LinkedHashMap<FieldName, FieldValue> readArgumentsFromLine(String[] splits, List<InputField> inputFields) {
        List<Double> lists = new ArrayList<Double>();
        lists.add(Double.valueOf(splits[3]));
        lists.add(Double.valueOf(splits[4]));
        lists.add(Double.valueOf(splits[5]));
        lists.add(Double.valueOf(splits[7]));
        lists.add(Double.valueOf(splits[8]));
        lists.add(Double.valueOf(splits[9]));
        lists.add(Double.valueOf(splits[10]));
        lists.add(Double.valueOf(splits[11]));
        LinkedHashMap<FieldName, FieldValue> arguments = new LinkedHashMap<FieldName, FieldValue>();
        int i = 0;
        for(InputField inputField : inputFields){
            FieldName inputFieldName = inputField.getName();
            // The raw (ie. user-supplied) value could be any Java primitive value
            Object rawValue = lists.get(i);
            // The raw value is passed through: 1) outlier treatment, 2) missing value treatment, 3) invalid value treatment and 4) type conversion
            FieldValue inputFieldValue = inputField.prepare(rawValue);
            arguments.put(inputFieldName, inputFieldValue);
            i+=1;
        }
        return arguments;
    }
    public static Double transToDouble(String label) {
        try {
            return Double.valueOf(label);
        }catch (Exception e){
            return Double.valueOf(0);
        }
    }
}

Spark ML離線訓練模型用於線上預測

最近公司有需求需要將離線訓練好的演算法模型應用到線上去實時預測，線上預測不考慮feature加工的情況下，經調研，發現jpmml-sparkml+jpmml-evaluator的方式可以滿足條件。不過使用時需要注意該框架是AGPL-3.0協議。方案：spark ml +

R語言基於支援向量機訓練模型實現類預測

前面介紹了基於訓練集訓練SVM的方法。通過訓練，演算法能找到使間隔區間最大化的最優平面來分割訓練資料集，得到SVM模型能夠被用來預測新到樣例的類別。準備使用之前構建的churn構建的model. 操作利用已構建的SVM模型和測試資料集的屬性

開工的慾望 | AI Studio悄然上線新功能，用你的模型生成線上預測服務

開工第一天，小夥伴們是不是還沒有從過年的狀態轉換過來？今天給大家介紹一個AI St

Spark ML流式線上學習模型初步構建分析-Spark商業ML實戰

本套技術專欄是作者（秦凱新）平時工作的總結和昇華，通過從真實商業環境抽取案例進行總結和分享，並給出商業應用的調優建議和叢集環境容量規劃等內容，請持續關注本套部落格。版權宣告：禁止轉載，歡迎學習。QQ郵箱地址：[email protected]，如有任何商業交流，可隨時聯絡。

Spark ML自定義選擇最優模型演算法深入剖析-Spark商業ML實戰

使用scikit-learn訓練可用於iOS上的coreML模型

為什麼不想再用turicreate？在之前的學習中，主要使用的是python的turicreate框架來生成iOS上的機器學習模型.mlmodel檔案。蘋果從WWDC2017年開始就致力於不斷降低iOS開發者使用機器學習技術的難度，降低開發者們學習機器學習知

[AI] 使用object detection訓練自己的模型用於物體識別

軟體環境 tensorflow (1.8.0) libprotoc 3.0.0/3.3.0 tensorflow models models/research/object_detection (2018/dec版本) models git clone https://

將已訓練模型轉換為 Core ML

將由第三⽅方機器學習工具所建立的已訓練模型，轉換為 Core ML 模型格式。概述如果您已經使用了第三方機器學習工具來建立和訓練模型，只要這個工具是受支援的，那麼就可以使用 Core ML Tools 或者第三方工具（如： MXNet conve

機器學習模型部署及線上預測

到目前為止，我們訓練的傳統機器學習模型都只能進行本地預測（本地呼叫test方法），那麼怎麼樣把我們的模型部署到線上，然後做線上實時預測呢？我們的模型實際上就是一個y = f(x)函式，x是特徵資料，y是預測結果。我們訓練模型的目的就是為了得到f(x)函式的引數；訓練完成後需要對引數進行序列化儲存

Spark中元件Mllib的學習9之ALS訓練的model來預測資料的準確率研究

1解釋研究ALS的準確率 2.程式碼： package org.apache.spark.mllib.learning.recommend import java.text.SimpleDateFormat import java.util.D

基於Spark和Tensorflow構建DCN模型進行CTR預測

scalar spa tro default beta unit oar 保留介紹實驗介紹數據采用Criteo Display Ads。這個數據一共11G，有13個integer features，26個categorical features。 Spark 由於數

keras 模型用於預測時的注意事項

為什麼訓練誤差比測試誤差高很多？一個Keras的模型有兩個模式：訓練模式和測試模式。一些正則機制，如Dropout，L1/L2正則項在測試模式下將不被啟用。另外，訓練誤差是訓練資料每個batch

C#中的深度學習（五）：在ML.NET中使用預訓練模型進行硬幣識別

在本系列的最後，我們將介紹另一種方法，即利用一個預先訓練好的CNN來解決我們一直在研究的硬幣識別問題。在這裡，我們看一下轉移學習，調整預定義的CNN，並使用Model Builder訓練我們的硬幣識別模型。我們將使用ML.NET代替Keras.NET。為什麼不使用Keras.NET呢?儘管Keras.NE

ML.net重新訓練模型需要注意的事項。

ml.net是微軟機器學習的東西，如果你的需求是需要一個固定的模型來進行操作的話那就按著官網的教程來就可以，但是大部分的模型可能不滿足現有的需求，那麼我們需要對模型進行重新訓練。重新訓練模型有限制條件，你重新訓練模型的資料分類必須是原有模型已有的分類，如果想增加分類的話只能重新訓練一個新的模型來進行操作。

【轉載】 Faster-RCNN+ZF用自己的數據集訓練模型(Matlab版本)

cmp fin ont -m lac tails ram pos 識別說明：本博文假設你已經做好了自己的數據集，該數據集格式和VOC2007相同。下面是訓練前的一些修改。（做數據集的過程可以看http://blog.csdn.net/sinat_30071459/art

下載基於大數據技術推薦系統實戰教程(Spark ML Spark Streaming Kafka Hadoop Mahout Flume Sqoop Redis)

大數據技術推薦系統推薦系統實戰地址:http://pan.baidu.com/s/1c2tOtwc 密碼:yn2r82課高清完整版，轉一播放碼。互聯網行業是大數據應用最前沿的陣地，目前主流的大數據技術，包括 hadoop，spark等，全部來自於一線互聯網公司。從應用角度講，大數據在互聯網領域主

搭建caffe環境及訓練模型

visio macosx gpu .sh 編寫共享庫 image ins bash Mac OS搭建caffe環境一、相關軟件安裝 1、安裝CUDA8 （1）安裝xcode及命令行（2）安裝CUDA的dmg包（3）配置環境變量並

矩陣壓縮寫法 scipy spark.ml.linalg裏都有，CRS，CCS

.net anr net pre https html www fan AI CRS 表示：Compressed Row Storage CCS 表示：Compressed Column Storage CRS的表示參考： https://blog.csdn.net/b

機器學習-Python中訓練模型的保存和再使用

ext model job 模型保存 PE dir code pre del 模型保存 BP：model.save(save_dir) SVM： from sklearn.externals import joblib joblib.dump(clf, save_dir)

訓練模型：交叉驗證

計算其余 CI ron 包括樣本中大 IE justify 一.基本概述用交叉驗證的目的是為了得到可靠穩定的模型。消除測試集與訓練集選擇的不好，導致訓練的模型不好。二.k折交叉驗證 K折交叉驗證，初始采樣分割成K個子樣本，一個單獨的子樣本被保留作為驗證模型的

Spark ML離線訓練模型用於線上預測

方案：spark ml + jpmml-sparkml + jpmml-evaluator

Spark離線訓練Random Forest模型並儲存為pmml格式：

jpmml-evaluator實現線上實時預測：

相關推薦