Spark2.0 Pipelines

阿新 • • 發佈：2018-01-19

2.0 容易 share pip predict 機器學習一個 ref 實的

MLlib中眾多機器學習算法API在單一管道或工作流中更容易相互結合起來使用。管道的思想主要是受到scikit-learn庫的啟發。
ML API使用Spark SQL中的DataFrame作為機器學習的數據集。DataFrame不同的列可以分別存儲文本，特征向量，真實的Lables，和預測值。

Transformer:一個Transformer是一個算法，可以將一個DataFrame轉換為另一個DataFrame。如將一個帶特征值的DataFrame轉換為帶預測值的DataFrame。
Estimator：Estimator在一個DataFrame上完成Transformer轉換過程。如一個學習算法就是一個Estimator，該Estimator應用在測試DataFrame上，完成模型的訓練過程。

Pipelie：將多個Transformers和 Estimators 串在一起，以完成某個特定的機器學習工作流程。
參數：全部Transformers和 Estimators 共享通用的API，以完成各自特定參數的設置。

MLlib standardizes APIs for machine learning algorithms to make it
easier to combine multiple algorithms into a single pipeline, or
workflow. This section covers the key concepts introduced by the

Pipelines API, where the pipeline concept is mostly inspired by the
scikit-learn project.

DataFrame: This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors, true labels, and predictions.

Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data types.

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

A DataFrame can be created either implicitly or explicitly from a
regular RDD. See the code examples below and the Spark SQL programming guide for examples.

Columns in a DataFrame are named. The code examples below use names such as “text,” “features,” and “label.”

Transformer: A Transformer is an algorithm which can transform one
DataFrame into another DataFrame. E.g., an ML model is a Transformer
which transforms a DataFrame with features into a DataFrame with
predictions.

Estimator: An Estimator is an algorithm which can be fit on a
DataFrame to produce a Transformer. E.g., a learning algorithm is an
Estimator which trains on a DataFrame and produces a model.

Pipeline: A Pipeline chains multiple Transformers and Estimators
together to specify an ML workflow. Parameter: All Transformers and
Estimators now share a common API for specifying parameters.

Spark2.0 Pipelines

2.0 容易 share pip predict 機器學習一個 ref 實的 MLlib中眾多機器學習算法API在單一管道或工作流中更容易相互結合起來使用。管道的思想主要是受到scikit-learn庫的啟發。 ML API使用Spark SQL中的DataFrame作為

Spark2.0 Pipelines

Spark2.0 Pipelines

【Spark2.0源碼學習】-6.Client啟動

【Spark2.0源碼學習】-10.Task執行與回饋

spark2.0.1源碼編譯

Spark2.0 協同過濾推薦

Spark2.0機器學習系列之7： MLPC（多層神經網絡）

Spark2.0 特征提取、轉換、選擇之二：特征選擇、文本處理，以中文自然語言處理(情感分類)為例

CDH5.12.0 如何升級到Spark2.0 版本

Spark2.0 shuffle service

圖文解析Spark2.0核心技術(轉載)

CentOS7安裝spark2.0叢集

Spark2.0操作ES

Spark2.0學習（二）--------RDD詳解

Spark2.0操作Hbase

spark2.0.0釋出

解決升級Spark2.0之後，DataFrame map操作報錯

【備忘】年薪50萬2017年最新Spark2.0從入門到精通教程

基於Spark2.0搭建Hive on Spark環境(Mysql本地和遠端兩種情況)

【備忘】2018年薪50萬2017年最新北風網Spark2.0從入門到精通教程

spark2.0從入門到精通(一)

Spark2.0 Pipelines

相關推薦