資料探勘_Python-Spark-Flink機器學習開發工具對比

阿新 • • 發佈：2020-10-21

不同的工具

在機器學習的常用工具中，一般的資料探勘和資料統計分析的工具，是R語言和Python，大量的資料時候，使用的是Flink和Spark。
瞭解和熟悉工具的使用，對於一些資料進行探索和實現。
 本文主要是基於Python的資料探勘和機器學習的流程，來對比Spark和Flink的機器學習包，進而通過使用其中的一種情況而熟悉其他，達到觸類旁通的效果

Python

 一般流程： 獲取資料 -> 資料預處理 -> 訓練建模 -> 模型評估 -> 預測，分類
scikit-learn ：  NumPy  SciPy  matplotlib
  管道機制實現了對全部步驟的流式化封裝和管理（streaming workflows with pipelines）
      許多演算法模型串聯起來，比如將特徵提取、歸一化、分類組織在一起形成一個典型的機器學習問題工作流 程式設計技巧的創新，而非演算法的創新
     Transformer 轉換器  Estimator 估計器  Pipeline 管道
  具體
     01.Transformer 轉換器 (StandardScaler，MinMaxScaler)
     02.Estimator 估計器（LinearRegression、LogisticRegression、LASSO、Ridge），
        所有的機器學習演算法模型，都被稱為估計器
     03.Pipeline 管道 將Transformer、Estimator 組合起來成為一個大模型
    	 pipeline
        使用PipeLine對資料進行預處理組成新的模型
        直接呼叫fit和predict方法來對pipeline中的所有演算法模型進行訓練和預測
    	可以結合grid search對引數進行選擇
 示例
     eg： from sklearn.pipeline import Pipeline
     過程：
      資料歸一化(Data Normalization)  from sklearn import preprocessing
      特徵選擇(Feature Selection)     from sklearn.ensemble import ExtraTreesClassifier
      演算法的使用                      from sklearn.linear_model import LogisticRegression
      優化演算法引數                    from sklearn.grid_search import GridSearchCV 
     one-hot編碼
	 資料集拆分
	 模型：
	  # 擬合模型
      model.fit(X_train, y_train)
     # 模型預測
      model.predict(X_test)    
     # 獲得這個模型的引數
      model.get_params()
	 模型儲存和載入
	  from sklearn.externals import joblib
	# 儲存模型
	  joblib.dump(model, 'model.pickle')
	#載入模型
	  model = joblib.load('model.pickle')

Spark

1.基本概念

org.apache.spark.ml 
PipelineStage
A stage in a pipeline, either an [[Estimator]] or a [[Transformer]].
Transformer
transform one dataset into another.
Estimator
estimators that fit models to data.
Model
A fitted model, i.e., a [[Transformer]] produced by an [[Estimator]].
Pipeline
A Pipeline consists of a sequence of stages, each of which is either an [[Estimator]] or a [[Transformer]]

PipelineModel
 object PipelineModel extends MLReadable[PipelineModel]
Parameter 
 被用來設定 Transformer 或者 Estimator 的引數
VectorAssembler
   CrossValidatorModel
        Params for [[CrossValidator]] and [[CrossValidatorModel]].
		Spark提供在org.apache.spark.ml.tuning包下提供了模型選擇器，可以替換引數然後比較模型輸出

2.Spark 的 Dataset

randomSplit
Randomly splits this Dataset with the provided weights.

 randomSplitAsList
 Returns a Java list that contains randomly split Dataset with the provided weights.
輸入： weights: Array[Double]
       weights: List[Double]
返回： Array[Dataset]or List
示例：
 正樣本和負樣本擷取（樣本資料過多的情況）
                       double[] weights = {pos_rate,1.0-pos_rate};
                       Dataset<Row>[] arr = posSet.randomSplit(weights);
                       posSet = arr[0];
  正樣本和負樣本均衡
//合併正負樣本資料
                   Dataset<Row> dataUse = dataPos_sample.union(dataNeg_sample);   
// 定義 Pipeline 中的各個 PipelineStage ，如指標提取和轉換模型訓練等。
  有了這些處理特定問題的 Transformer 和 Estimator，
 我們就可以按照具體的處理邏輯來有序的組織 PipelineStages 並建立一個 Pipeline
 每個stage要麼是一個Transformer，要麼是一個Estimator。
 這些stage是按照順序執行的，輸入的dataframe當被傳入每個stage的時候會被轉換
 Pipeline pipeline = new Pipeline().setStages(Array(stage1,stage2,stage3,…))
 然後就可以把 訓練資料集 作為入參並呼叫 Pipeline 例項的 fit 方法來開始以流的方式來處理源訓練資料

//構建完成一個 stage piple
    Pipeline pipeline = new Pipeline().setStages(pipeArr);
	PipelineModel model = pipeline.fit(train_data);

    載入模型： PipelineModel model2 = PipelineModel.load(path);
 方式 獲得 CrossValidator 的最佳模型引數 -- 通過交叉驗證進行模型選擇
  CrossValidator rf_cv = new CrossValidator().setEstimator(pipeline)
  CrossValidatorModel rf_model = rf_cv.fit(train_data);
    載入模型： CrossValidatorModel rf_model2 = CrossValidatorModel.load(path);
	  
 eg： // Chain indexers and tree in a Pipeline.
 Pipeline pipeline = new Pipeline()
  .setStages(new PipelineStage[]{labelIndexer, featureIndexer, dt, labelConverter});

Flink

1.Flink ML

PipelineStage 
    Base class for a stage in a pipeline，and does not have any actual functionality
    Its subclasses must be either Estimator or Transformer    
Transformer
       * A transformer is a {@link PipelineStage} that transforms an input {@link Table} to a result {@link Table}.   
Estimator
        Estimators are {@link PipelineStage}s responsible for training and generating machine learning models.
Model
       A model is an ordinary {@link Transformer} except how it is created.   
 Pipeline
       A pipeline is a linear workflow which chains {@link Estimator}s and {@link Transformer}s to execute an algorithm.
     can also be used as a {@link PipelineStage} in another pipeline
   
 Params WithParams  ParamInfoFactory  ParamInfo

2.Alink

com.alibaba.alink.pipeline
 Pipeline
     A pipeline is a linear workflow which chains {@link EstimatorBase}s and {@link TransformerBase}s to
  * execute an algorithm.
     public class Pipeline extends EstimatorBase<Pipeline, PipelineModel> 
 PipelineModel
      public class PipelineModel extends ModelBase<PipelineModel> implements LocalPredictable {
 PipelineStageBase
      The base class for a stage in a pipeline, either an [[EstimatorBase]] or a [[TransformerBase]].
 EstimatorBase
    public abstract class EstimatorBase<E extends EstimatorBase<E, M>, M extends ModelBase<M>> extends PipelineStageBase<E> implements Estimator<E, M>
 TransformerBase 
     public abstract class TransformerBase<T extends TransformerBase<T>>  extends PipelineStageBase<T> implements Transformer<T>
 VectorAssembler
     VectorAssembler is a transformer that combines a given list of columns

參考

原始碼

資料探勘_Python-Spark-Flink機器學習開發工具對比

不同的工具在機器學習的常用工具中，一般的資料探勘和資料統計分析的工具，是R語言和Python，大量的資料時候，使用的是Flink和Spark。

機器學習&資料探勘筆記_15（關於凸優化的一些簡單概念）

　　沒有系統學過數學優化，但是機器學習中又常用到這些工具和技巧，機器學習中最常見的優化當屬凸優化了，這些可以參考Ng的教學資料：http://cs229.stanford.edu/section/cs229-cvxopt.pdf，從中我們可以大致瞭解到

資料探勘、機器學習、深度學習和人工智慧概念

機器學習可以簡單解釋為使用一些演算法從資料中分析出某種規律，然後利用這一規律對未知資料進行預測，所以機器學習不是手動編寫某種程式去完成一個任務，而是使用大量的資料和演算法來“訓練”機器，讓機器通過“

機器學習與資料探勘——分類與預測模型

Override the entrypoint of an image Introduced in GitLab and GitLab Runner 9.4. Read more about the extended configuration options.

0基礎大資料學習：資料探勘的作用

在大資料準確營銷和大資料洞察力等一系列熱門詞彙的背後，資料探勘和分析技術在各行業發揮著重要作用，隨著資料資源的爆炸性增長，資料探勘技術不僅成為政府部門提高治理能力的重要手段，而且成為提升各行業

阿里架構師心血總結微服務+分散式+大資料+Flink+機器學習文件！

前言：關於程式設計師，一直以來有個預設的共識，這是個高薪的職業但也“高危”。加不完的班、關不掉的電腦、積重難返的身體負荷......在網際網路大熱狼性文化橫行的今天，更是如此。那麼面對優勝劣汰的技

跟我一起資料探勘（22）——spark入門

Spark簡介 Spark是UC Berkeley AMP lab所開源的類Hadoop MapReduce的通用的並行，Spark，擁有Hadoop MapReduce所具有的優點；但不同於MapReduce的是Job中間輸出結果可以儲存在記憶體中，從而不再需要讀寫HDFS，因此S

python適合做資料探勘嗎

Python語言的崛起讓大家對web、爬蟲、資料分析、資料探勘等十分感興趣。資料探勘就業前景怎麼樣？關於這個問題的回答，大家首先要知道什麼是資料探勘。所謂資料探勘就是指從資料庫的大量資料中揭示出隱含的、先前未知

資料分析筆記：財政收入預測資料探勘分析

1、背景在我國現行的分稅制財政管理體制下，地方財政收入不僅是國家財政收入的重要組成部分，而且具有其相對獨立的構成內容。地方財政收入是區域國民經濟的綜合反映，也是市場經濟國家的政府進行巨集觀調控的基礎。

python文字處理資料探勘停用詞檢索

簡單描述程式功能：python+flask 1.停用詞為csv檔案 2.原始檔為txt檔案 3.文字處理，將原檔案中出現的停用詞去除

工資分配與資料探勘

摘要工資總額分配是與企業人力資源戰略緊密聯絡的管理要素。企業的工資總額對一個企業的未來發展至關重要，本文以2018年26個省市分公司年運營的統計資料作為研究物件，在合理假設的基礎上，綜合考慮國企對各省市分公

人工智慧之資料探勘：如何使用sklearn做資料探勘

目錄 1 使用sklearn進行資料探勘　　1.1 資料探勘的步驟　　1.2 資料初貌　　1.3 關鍵技術2 並行處理　　2.1 整體並行處理　　2.2 部分並行處理3 流水線處理4 自動化調參5 持久化6 回顧7 總結8 參考資料

資料分析與資料探勘 - 07資料處理

一 pandas基本資料型別 1 Series型別 Pandas是資料處理中非常常用的一個庫，是資料分析師、AI的工程師們必用的一個庫，對這個庫是否能夠熟練的應用，直接關係到我們是否能夠把資料處理成我們想要的樣子。Pandas是基於

資料探勘領域十大經典演算法之—K-鄰近演算法/kNN（超詳細附程式碼）

簡介又叫K-鄰近演算法，是監督學習中的一種分類演算法。目的是根據已知類別的樣本點集求出待分類的資料點類別。

[資料分析-資料探勘]BI-data analytics-data science

*資料分析**是一個相當廣的領域，其中包含了資料科學。資料科學是最近比較火的一個名詞，與傳統的資料分析相比都是從資料中找到知識和見解，只是在使用的技能和方式下有一定差異。並不是所有的資料分析都是

天池 - “零基礎入門資料探勘 - 二手車交易價格預測”TOP 2%開原始碼

本文是天池的零基礎入門資料探勘之“二手車交易價格預測大賽”的相關baseline與後續提分的完整程式碼思路分享。目前score在446，名次在200名以內，使用的模型為LightGBM，個人電腦對面15w+的資料量太慢了，

資料探勘實訓週報week3

本週主要學習了xgboost。 XGBoos是在AdaBoost和GBDT等提升演算法基礎上進行了優化的演算法，一般來說，演算法都是由模型、引數和目標函式三部分組成。模型可以理解為基函式(一個函式的固定形式，也就是函式

資料探勘相關知識與工具總結

>>> 資料探勘相關知識與工具 Python 資料分析工具庫：陣列處理：Numpy 簡介：python 強大的陣列庫

第四屆工業大資料賽事：時序序列預測 + 結構化資料探勘2種類型賽題！

↑↑↑關注後"星標"Datawhale 每日干貨&每月組隊學習，不錯過 Datawhale推薦

【資料探勘】GBDT，XGBoost

小白學資料，只為記錄學習程序，對每個問題有新的理解會及時更正。兩個模型都是基於Boosting的思想，採用串聯的方法來提升預測的精度

資料探勘_Python-Spark-Flink機器學習開發工具對比

不同的工具

Python

Spark

1.基本概念

2.Spark 的 Dataset

Flink

1.Flink ML

2.Alink

參考

相關推薦