使用pyspark進行機器學習（分類問題）

阿新 • • 發佈：2018-12-30

LogisticRegression

class pyspark.ml.classification.LogisticRegression(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, elasticNetParam=0.0, tol=1e-6, fitIntercept=True, threshold=0.5, thresholds=None, probabilityCol="probability", rawPredictionCol="rawPrediction" 
, standardization=True, weightCol=None, aggregationDepth=2, family="auto")

引數解釋

regParam: 正則化引數(>=0)
elasticNetParam: ElasticNet混合引數，0-1之間，當alpha為0時,懲罰為L2正則化，當為1時為L1正則化
fitIntercept: 是否擬合一個截距項
Standardization： 是否在擬合數據之前對資料進行標準化
aggregationDepth: 樹聚合所建議的深度(>=2)
Family：標籤分佈簇的名稱，可選：“auto", "binomial", " 
multinomial"
引數的getter和setter方法

擬合後的模型擁有的方法和屬性

coefficientMatrix: 模型的係數矩陣(對於multinomial而言)
coefficients: 雙變數logistic迴歸的模型係數，如果是多元Logistic迴歸的話會有異常
evaluate(dataset):在測試集上評估模型
hasSummary: 是否有summary
intercept: 二變數logistic模型的截距
interceptVector： 多變數logistic模型截距
summary：獲得summary
transform(dataset,param=None)
Summary擁有的屬性
predictions: 
 模型transform方法輸出的預測資料框
probabilityCol： 給出每個類的概率
對於雙變數的summary而言擁有的屬性：
areaUnderROC: 計算AUC
fMeasureByTreshold: 返回帶有兩個欄位(閾值，F-統計量)的資料框，beta=1.0
pr： 返回精度-召回率兩欄位的資料框
precisionByTreshold：返回帶有閾值，精度兩欄位的資料框，應用了從轉換後資料裡的所有可能概率作為閾值來計算精度
recallByTreshold： 返回帶有閾值，召回率兩欄位的資料框，應用了從轉換後資料裡的所有可能概率作為閾值來計算召回率
roc：返回帶有兩欄位FPR, TPR的資料框，

程式碼

from pyspark.ml.linalg import Vectors
from pyspark.sql import Row
from pyspark.ml.classification import LogisticRegression

#雙變數Logistic迴歸
bdf = sc.parallelize([Row(label=1.0,weight=2.0,features=Vectors.dense(1.0)),Row(label=0.0, weight=2.0, features=Vectors.sparse(1,[],[]))]).toDF()
bdf.show()
blor = LogisticRegression(maxIter=5, regParam=0.01,weightCol='weight')
blorModel = blor.fit(bdf)
blorModel.coefficients
blorModel.intercept

#多元Logistic迴歸
mdf = sc.parallelize([Row(label=1.0,weight=2.0, features=Vectors.dense(1.0)),Row(label=0.0,weight=2.0, features=Vectors.sparse(1,[],[])),Row(label=2.,weight=2.0, features=Vectors.dense(3.0))]).toDF()
mlor=LogisticRegression(maxIter=5,regParam=0.01,weightCol='weight',family='multinomial')
mlorModel = mlor.fit(mdf)
print mlorModel.coefficientMatrix
mlorModel.interceptVector

#模型預測
test0=sc.parallelize([Row(features=Vectors.dense(-1.0))]).toDF()
result = blorModel.transform(test0).head()
result.prediction

result.probability
result.rawPrediction

test1 = sc.parallelize([Row(features=Vectors.sparse(1,[0],[1.0]))]).toDF()
blorModel.transform(test1).head().prediction
blorModel.transform(test1).show()
#模型評估
blorModel.summary.roc.show()
blorModel.summary.pr.show()

DecisionTreeClassifier

class pyspark.ml.classification.DecisionTreeClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", seed=None)

引數解釋

checkpointInterval：設定checkpoint區間(>=1)，或宕掉checkpoint(-1)，例如10意味著緩衝區(cache)將會每迭代10次獲得一次checkpoint
fit(datasset,params=None)
impurity: 資訊增益計算的準則，選項"entropy", "gini"
maxBins： 連續特徵離散化的最大分箱，必須>=2 並且>=分類特徵分類的數量
maxDepth：樹的最大深度
minInfoGain：分割結點所需的最小的資訊增益
minInstancesPerNode：每個結點最小例項個數
引數的getter和setter方法

擬合後的模型擁有的方法和屬性

depth：返回決策樹深
featureImportances：估計每個特徵的重要性，特徵重要性的計算：
    • 特徵J的重要性=通過特徵j所分割結點的資訊增益的和，資訊增益由通過該結點的例項個數進行標準化
    • 整個樹的標準化的重要性和為1
transform(dataset,params=None)方法

程式碼

from pyspark.ml.linalg import Vectors from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame([(1.0, Vectors.dense(1.0)),(0.0, Vectors.sparse(1, [], []))], ["label", "features"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
dt = DecisionTreeClassifier(maxDepth=2, labelCol="indexed")
model = dt.fit(td)
model.numNodes
#3
 model.depth
#1
model.featureImportances
#SparseVector(1, {0: 1.0})
 model.numFeatures
#1
model.numClasses
#2
print(model.toDebugString)
#DecisionTreeClassificationModel (uid=...) of depth 1 with 3 nodes...
test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
result = model.transform(test0).head()
result.prediction
#0.0
result.probability
#DenseVector([1.0, 0.0])
result.rawPrediction
#DenseVector([1.0, 0.0])
test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
model.transform(test1).head().prediction
#1.0
dtc_path = temp_path + "/dtc"
dt.save(dtc_path)
dt2 = DecisionTreeClassifier.load(dtc_path)
dt2.getMaxDepth()
#2
model_path = temp_path + "/dtc_model"
model.save(model_path)
model2 = DecisionTreeClassificationModel.load(model_path)
model.featureImportances == model2.featureImportances
#True

RandomForestClassifier

class pyspark.ml.classification.RandomForestClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, impurity="gini", numTrees=20, featureSubsetStrategy="auto", seed=None, subsamplingRate=1.0)

引數解釋

checkpoint：同DecisionTreeClassifier
featureSubsetStrategy：每棵樹上要分割的特徵數目，選項為"auto","all", "onethird", "sqrt", "log2", "(0.0-1.0],"[1-n]"
fit(dataset,params=None)方法
impurity: 同DecisionTreeClassifier
maxBins:同DecisionTreeClassifier
maxDepth：同DecisionTreeClassifier
minInfoGain: 同DecisionTreeClassifier
numTrees: 訓練樹的個數
subsamplingRate: 用於訓練每顆決策樹的樣本個數，區間(0,1]
引數的getter和setter方法

擬合後的模型擁有的方法和屬性

featureImportances: 同DecisionTreeClassifier，詳見Hastie的《統計學習基礎》
getNumTrees: 樹的個數
transform(dataset, params=None)方法
treeWeights:返回各個樹的權重

程式碼

import numpy
from numpy import allclose
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame([
     (1.0, Vectors.dense(1.0)),
    (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
rf = RandomForestClassifier(numTrees=3, maxDepth=2, labelCol="indexed", seed=42)
model = rf.fit(td)
model.featureImportances

#SparseVector(1, {0: 1.0})
allclose(model.treeWeights, [1.0, 1.0, 1.0])
#True
test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
result = model.transform(test0).head()
result.prediction
#0.0
numpy.argmax(result.probability)
#0
numpy.argmax(result.rawPrediction)
#0
test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
model.transform(test1).head().prediction
#1.0
model.trees
#[DecisionTreeClassificationModel (uid=...) of depth..., DecisionTreeClassificationModel...]
rfc_path = temp_path + "/rfc"
rf.save(rfc_path)
rf2 = RandomForestClassifier.load(rfc_path)
 rf2.getNumTrees()
#3
model_path = temp_path + "/rfc_model"
model.save(model_path)
model2 = RandomForestClassificationModel.load(model_path)
model.featureImportances == model2.featureImportances
#True

GBTClassifier

class pyspark.ml.classification.GBTClassifier(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxDepth=5, maxBins=32, minInstancesPerNode=1, minInfoGain=0.0, maxMemoryInMB=256, cacheNodeIds=False, checkpointInterval=10, lossType="logistic", maxIter=20, stepSize=0.1, seed=None, subsamplingRate=1.0)

此分類器實現依據為：
J.H. Friedman. “Stochastic Gradient Boosting.” 1999.

引數解釋

checkpointInterval: 同DecisionTreeClassifier
fit(dataset,params=None)方法
lossType: GBT要最小化的損失函式，選項：logistic
maxBins: 同DecisionTreeClassifier
maxDepth: 同DecisionTreeClassifier
maxIter: 同DecisionTreeClassifier
minInfoGain: 同DecisionTreeClassifier
minInstancesPerNode：同DecisionTreeClassifier
stepSize: 每次迭代優化的步長
subsamplingRate: 同RandomForesetClassier
引數的getter和setter方法

擬合後的模型擁有的方法和屬性

featureImportances: 同DecisionTreeClassifier
getNumTrees：同RandomForesetClassier
totalNumNodes: 結點的總個數，所有樹的加總和
transform(dataset,params=None)方法
treeWeights: 同RandomForesetClassier

程式碼

from numpy import allclose
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import StringIndexer
df = spark.createDataFrame([(1.0, Vectors.dense(1.0)),(0.0,Vectors.sparse(1, [], []))], ["label", "features"])
stringIndexer = StringIndexer(inputCol="label", outputCol="indexed")
si_model = stringIndexer.fit(df)
td = si_model.transform(df)
gbt = GBTClassifier(maxIter=5, maxDepth=2, labelCol="indexed", seed=42)
model = gbt.fit(td)
model.featureImportances
#SparseVector(1, {0: 1.0})
allclose(model.treeWeights, [1.0, 0.1, 0.1, 0.1, 0.1])
#True
test0 = spark.createDataFrame([(Vectors.dense(-1.0),)], ["features"])
model.transform(test0).head().prediction
#0.0
test1 = spark.createDataFrame([(Vectors.sparse(1, [0], [1.0]),)], ["features"])
model.transform(test1).head().prediction
#1.0
model.totalNumNodes
#15
print(model.toDebugString)
#GBTClassificationModel (uid=...)...with 5 trees...
gbtc_path = temp_path + "gbtc"
gbt.save(gbtc_path)
gbt2 = GBTClassifier.load(gbtc_path)
gbt2.getMaxDepth()
#2
model_path = temp_path + "gbtc_model"
model.save(model_path)
model2 = GBTClassificationModel.load(model_path)
model.featureImportances == model2.featureImportances
#True
model.treeWeights == model2.treeWeights
#True
model.trees
#[DecisionTreeRegressionModel (uid=...) of depth..., DecisionTreeRegressionModel...]

NaiveBayes

class pyspark.ml.classification.NaiveBayes(self, featuresCol="features", labelCol="label", predictionCol="prediction", probabilityCol="probability", rawPredictionCol="rawPrediction", smoothing=1.0, modelType="multinomial", thresholds=None, weightCol=None)

引數解釋

fit(dataset, params=None)方法
modelType: 選項：multinomial和bernulli
smoothing: 平滑引數，應該>=0，預設為1.0
引數的setter和getter方法

擬合後的模型擁有的方法和屬性

numClasses: 類的個數
numFeatures: 訓練模型的特徵個數，如果不知道，返回-1
pi: 類先驗的對數
theta: 類條件概率的對數
transform(dataseet,params=None)方法

程式碼

from pyspark.sql import Row
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([Row(label=0.0, weight=0.1, features=Vectors.dense([0.0, 0.0])),Row(label=0.0, weight=0.5, features=Vectors.dense([0.0, 1.0])),Row(label=1.0, weight=1.0, features=Vectors.dense([1.0, 0.0]))])

nb = NaiveBayes(smoothing=1.0, modelType="multinomial", weightCol="weight")
model = nb.fit(df)
model.pi
#DenseVector([-0.81..., -0.58...])
model.theta
#DenseMatrix(2, 2, [-0.91..., -0.51..., -0.40..., -1.09...], 1)
test0 = sc.parallelize([Row(features=Vectors.dense([1.0, 0.0]))]).toDF()
result = model.transform(test0).head()
result.prediction
#1.0
result.probability
#DenseVector([0.32..., 0.67...])
result.rawPrediction
#DenseVector([-1.72..., -0.99...])
test1 = sc.parallelize([Row(features=Vectors.sparse(2, [0], [1.0]))]).toDF()
model.transform(test1).head().prediction
#1.0
nb_path = temp_path + "/nb"
nb.save(nb_path)
nb2 = NaiveBayes.load(nb_path)
nb2.getSmoothing()
#1.0
model_path = temp_path + "/nb_model"
model.save(model_path)
model2 = NaiveBayesModel.load(model_path)
model.pi == model2.pi
#True
model.theta == model2.theta
#True
nb = nb.setThresholds([0.01, 10.00])
model3 = nb.fit(df)
result = model3.transform(test0).head()
result.prediction
#0.0

使用pyspark進行機器學習（分類問題）

LogisticRegression class pyspark.ml.classification.LogisticRegression(self, featuresCol="feature

使用pyspark進行機器學習（迴歸問題）

DecisionTreeRegressor class pyspark.ml.regression.DecisionTreeRegressor(self, featuresCol="features", labelCol="label", predic

使用pyspark進行機器學習（聚類問題）

BisectingKMeans class pyspark.ml.clustering.BisectingKMeans(self, featuresCol="features", predic

機器學習（五） Logistic Regression 分類器

前言上一篇文章我們談了談基於概率論的分類，這篇我們繼續談論分類問題，這篇講述的是一種最優化問題，即通過簡單計算並不能得出來最終結果，需要一步步來優化求最優值，這種分類方法應用廣泛，也是我們必須要熟練掌握的分類

機器學習（一）：快速入門線性分類器

定義假設特徵與分類結果存線上性關係的模型，這個模型通過累加計算每個維度的特徵與各自權重的乘積來幫助類別決策。線性關係公式 :f(w,x,b)=w^tx+b x=(x1,x2,…xn) 代表n維特徵列向量，w=(w1,w2,…wn)代表對應的權

機器學習（二）分類器及迴歸擬合

在機器學習中，分類器作用是在標記好類別的訓練資料基礎上判斷一個新的觀察樣本所屬的類別。分類器依據學習的方式可以分為非監督學習和監督學習。非監督學習顧名思義指的是給予分類器學習的樣本但沒有相對應類別標籤，主要是尋找未標記資料中的隱藏結構。，監督學習通過標記的訓練資

機器學習（二）：快速入門SVM分類

定義 SVM便是根據訓練樣本的分佈，搜尋所有可能的線性分類器中最佳的那個。仔細觀察彩圖中的藍線，會發現決定其位置的樣本並不是所有訓練資料，而是其中的兩個空間間隔最小的兩個不同類別的資料點，而我們把這種可以用來真正幫助決策最優線性分類模型的資料點稱為”支

機器學習（六）分類模型--線性判別法、距離判別法、貝葉斯分類器

機器學習（六）分類模型--線性判別法、距離判別法、貝葉斯分類器首先我們瞭解常見的分類模型和演算法有哪些線性判別法簡單來說就是用一些規定來尋找某一條直線，用直線劃分學習集，然後根據待測點在直線的哪一邊決定它的分類如圖，假如紅色這條線是找

OpenCV機器學習（1）：貝葉斯分類器實現程式碼分析

OpenCV的機器學習類定義在ml.hpp檔案中，基礎類是CvStatModel，其他各種分類器從這裡繼承而來。今天研究CvNormalBayesClassifier分類器。 1.類定義在ml.hpp中有以下類定義： class CV_EXPORTS_W CvNorm

機器學習（三）——Apriori演算法進行關聯分析

一. 關聯分析關聯分析是一種在大規模資料集上尋找有趣關係的任務。這些關係可以有兩種形式：頻繁項集與關聯規則。頻繁項集是經常出現在一塊的物品的集合；關聯規則暗示兩種物品之間可能存在很強的關係。如何量化的評價頻繁項集和關聯規則呢？由此引出了支援度（support）和置信度（confidence）。

Python與機器學習（五）樸素貝葉斯分類

1.安裝中文分詞器由於本文是對中文文字進行分類，故需要用到中文分詞器，而結巴分詞則是Python支援較好的一款分詞器。使用命令安裝： pip3 install jieba3k 或者下載結巴分詞檔案【下載】結巴分詞測試：結巴分詞支援三種分詞模式：精確模式，也是結巴

Python機器學習（二） Logistic迴歸建模分類例項——信用卡欺詐監測（上）

Logistic

機器學習（一）------分類

機器學習（一）------分類機器學習分類機器學習分為監督學習和無監督學習兩類。監督學習是指在有標記的樣本上建立機器學習的模型（這類演算法知道預測什麼，即目標變數的分類資訊）。無監督學習恰恰相反，是指沒有標記的資料上建立學習模型。主要任務：分類：主要任務是將例項資料劃分到合適的分類中

機器學習（一）：用sklearn進行資料預處理：缺失值處理、資料標準化、歸一化

在我們平時進行資料資料探勘建模時，一般首先得對資料進行預處理，其中就包括資料缺失值、異常值處理、資料的標準化、歸一化等等。下面主要介紹如何對一個數據檔案進行資料的缺失值處理、標準化和歸一化 MID_SP MID_AC MID_R25 MID_COND LITHO1 55.

機器學習（十）使用sklearn庫對時間特徵進行處理

轉自寒老師的七月演算法ML課程，加了一點自己理解 # -*- coding: utf-8 -*- """ Created on Mon Oct 31 20:27:11 2016 @author: Sirius 特徵工程之時間型特徵處理 """ imp

機器學習（三）：感知器演算法實現鳶尾花分類專案實戰

上一章我們已經介紹了感知器演算法規則，並且用python語言實現了。現在我們應用感知器學習規則進行鳶尾花分類實驗。測試資料我們從鳶尾花資料集中挑選出了山鳶尾（Setosa）和變色鳶尾（Versicolor）兩種花的資訊作為測試資料。雖然感知器並不將資料樣本特

機器學習（二）－一個最簡單的線性分類模擬人腦神經工作

機器學習和傳統演算法不同的關鍵在於，傳統演算法依靠固定的演算法來處理資料，因此，演算法對於資料而言，邏輯是固定的。機器學習的做法是演算法是框架，需要訓練資料形成邏輯，再通過邏輯去識別，判定和預測新的，

機器學習（一）

無限 size mage 計算機博客對數 png 表示獲得對學習的定義學習經驗E 完成任務T 績效指標P 對於某類任務T和性能度量P，如果一個計算機程序在完成T上以P來衡量完成的好壞，並隨著經驗E而自我完善，那麽我們稱這個計算機程序在從經驗E中學習

Spark機器學習（上）

控制常用 nbsp 建立判斷測試數據話題 with 分享圖片 1、機器學習概念 1.1 機器學習的定義在維基百科上對機器學習提出以下幾種定義：l“機器學習是一門人工智能的科學，該領域的主要研究對象是人工智能，特別是如何在經驗學習中改善具體算法的性能”。l“機

機器學習（一）梯度下降算法的實現及過程分析

回歸 vnc 分布 AC HA 向量 med mar size 機器學習（一）梯度下降算法因為算法最好能應用到實際問題中才會讓讀者感到它的真實的用處，因此首先我來描述一個實際問題（梯度下降算法用以幫助解決該問題）：給定一個指定的數據集，比如由若幹某一

使用pyspark進行機器學習（分類問題）

LogisticRegression

引數解釋

擬合後的模型擁有的方法和屬性

程式碼

DecisionTreeClassifier

引數解釋

擬合後的模型擁有的方法和屬性

程式碼

RandomForestClassifier

引數解釋

擬合後的模型擁有的方法和屬性

程式碼

GBTClassifier

引數解釋

擬合後的模型擁有的方法和屬性

程式碼

NaiveBayes

引數解釋

擬合後的模型擁有的方法和屬性

程式碼

相關推薦