spark廈大----基本的統計工具（2）

阿新 • • 發佈：2019-01-15

來源：http://mocom.xmu.edu.cn/article/show/584d1fc5bd8177b41ebbd8bc/0/1

五、假設檢驗 Hypothesis testing

Spark目前支援皮爾森卡方檢測（Pearson’s chi-squared tests），包括“適配度檢定”（Goodness of fit）以及“獨立性檢定”（independence）。

首先，我們匯入必要的包

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint 

import org.apache.spark.mllib.stat.Statistics._

接下來，我們從資料集中選擇要分析的資料，比如說我們取出iris資料集中的前兩條資料v1和v2。不同的輸入型別決定了是做擬合度檢驗還是獨立性檢驗。擬合度檢驗要求輸入為Vector, 獨立性檢驗要求輸入是Matrix。

scala> val v1: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3 
).toDouble)).first
v1: org.apache.spark.mllib.linalg.Vector = [5.1,3.5,1.4,0.2]
scala> val v2: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).take(2).last
v2: org.apache.spark.mllib.linalg.Vector = [4.9 
,3.0,1.4,0.2]

(一) 適合度檢驗 Goodness fo fit

Goodness fo fit（適合度檢驗）：驗證一組觀察值的次數分配是否異於理論上的分配。其 H0假設（虛無假設，null hypothesis）為一個樣本中已發生事件的次數分配會服從某個特定的理論分配。實際執行多項式試驗而得到的觀察次數，與虛無假設的期望次數相比較，檢驗二者接近的程度，利用樣本資料以檢驗總體分佈是否為某一特定分佈的統計方法。

通常情況下這個特定的理論分配指的是均勻分配，目前Spark預設的是均勻分配。以下是程式碼：

scala> val goodnessOfFitTestResult = Statistics.chiSqTest(v1)
goodnessOfFitTestResult: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 5.588235294117647
pValue = 0.1334553914430291
No presumption against null hypothesis: observed follows the same distribution as expected..

可以看到P值，自由度，檢驗統計量，所使用的方法，以及零假設等資訊。我們先簡單介紹下每個輸出的意義：

method: 方法。這裡採用pearson方法。

statistic：檢驗統計量。簡單來說就是用來決定是否可以拒絕原假設的證據。檢驗統計量的值是利用樣本資料計算得到的，它代表了樣本中的資訊。檢驗統計量的絕對值越大，拒絕原假設的理由越充分，反之，不拒絕原假設的理由越充分。

degrees of freedom：自由度。表示可自由變動的樣本觀測值的數目，

pValue：統計學根據顯著性檢驗方法所得到的P 值。一般以P < 0.05 為顯著， P<0.01 為非常顯著，其含義是樣本間的差異由抽樣誤差所致的概率小於0.05 或0.01。

一般來說，假設檢驗主要看P值就夠了。在本例中pValue =0.133，說明兩組的差別無顯著意義。通過V1的觀測值[5.1, 3.5, 1.4, 0.2]，無法拒絕其服從於期望分配（這裡預設是均勻分配）的假設。

（二）獨立性檢驗 Indenpendence

卡方獨立性檢驗是用來檢驗兩個屬性間是否獨立。其中一個屬性做為行，另外一個做為列，通過貌似相關的關係考察其是否真實存在相關性。比如天氣溫變化和肺炎發病率。

首先，我們通過v1、v2構造一個舉證Matrix，然後進行獨立性檢驗：

scala> val mat: Matrix =Matrices.dense(2,2,Array(v1(0),v1(1),v2(0),v2(1)))
mat: org.apache.spark.mllib.linalg.Matrix =
5.1  4.9
3.5  3.0
scala> val a =Statistics.chiSqTest(mat)
a: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 0.012787584067389817
pValue = 0.90996538641943
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

這裡所要檢驗是否獨立的兩個屬性，一個是樣本的序號，另一個是樣本的資料值。在本例中pValue =0.91，說明無法拒絕“樣本序號與資料值無關”的假設。這也符合資料集的實際情況，因為v1和v2是從同一個樣本抽取的兩條資料，樣本的序號與資料的取值應該是沒有關係的。

我們也可以把v1作為樣本，把v2作為期望值，進行卡方檢驗：

scala> val c1 = Statistics.chiSqTest(v1, v2)
c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 0.03717820461517941
pValue = 0.9981145601231336
No presumption against null hypothesis: observed follows the same distribution as expected..

本例中pValue =0.998，說明樣本v1與期望值等於V2的資料分佈並無顯著差異。事實上，v1=[5.1,3.5,1.4,0.2]與v2= [4.9,3.0,1.4,0.2]很像，v1可以看做是從期望值為v2的資料分佈中抽樣出來的的。

同樣的，鍵值對也可以進行獨立性檢驗，這裡我們取iris的資料組成鍵值對：

scala> val data=sc.textFile("G:/spark/iris.data")
data: org.apache.spark.rdd.RDD[String] = G:/spark/iris.data MapPartitionsRDD[13] at textFile at :44
scala>     val obs = data.map{ line =>
     |       val parts = line.split(',')
     |       LabeledPoint(if(parts(4)=="Iris-setosa") 0.toDouble else if (parts(4)=="Iris-versicolor") 1.toDouble else
     |       2.toDouble, Vectors.dense(parts(0).toDouble,parts(1).toDouble,parts
(2).toDouble,parts(3).toDouble))}
obs: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[14] at map at :46

進行獨立性檢驗，返回一個包含每個特徵對於標籤的卡方檢驗的陣列：

scala> val featureTestResults= Statistics.chiSqTest(obs)
featureTestResults: Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
Array(Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi...

這裡實際上是把特徵資料中的每一列都與標籤進行獨立性檢驗。可以看出，P值都非常小，說明可以拒絕“某列與標籤列無關”的假設。也就是說，可以認為每一列的資料都與最後的標籤有相關性。我們用foreach把完整結果打印出來：

scala> var i = 1
i: Int = 1
scala> featureTestResults.foreach { result =>
     |   println(s"Column $i:\n$result")
     |   i += 1
     | }
Column 1:
Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 2:
Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 3:
Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 4:
Chi squared test summary:
method: pearson
degrees of freedom = 42
statistic = 271.75
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..

spark也支援Kolmogorov-Smirnov 檢驗，下面將展示具體的步驟：

scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at :44
// run a KS test for the sample versus a standard normal distribution
scala> val testResult = Statistics.kolmogorovSmirnovTest(test, "norm", 0, 1)
testResult: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult =
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.999991460094529
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
// perform a KS test using a cumulative distribution function of our making
scala>     val myCDF: Double => Double = (p=>p*2)
myCDF: Double => Double = 
scala>     val testResult2 = Statistics.kolmogorovSmirnovTest(test, myCDF)
testResult2: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult = Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 14.806666666666668
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.

六、隨機數生成 Random data generation

RandomRDDs 是一個工具集，用來生成含有隨機數的RDD，可以按各種給定的分佈模式生成資料集，Random RDDs包下現支援正態分佈、泊松分佈和均勻分佈三種分佈方式。RandomRDDs提供隨機double RDDS或vector RDDS。

下面的例子中生成一個隨機double RDD，其值是標準正態分佈N（0，1），然後將其對映到N（1，4）。

首先，匯入必要的包：

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

生成1000000個服從正態分配N(0,1)的RDD[Double]，並且分佈在 10 個分割槽中：

scala> val u = normalRDD(sc, 10000000L, 10)
u: org.apache.spark.rdd.RDD[Double] = RandomRDD[35] at RDD at RandomRDD.scala:38

把生成的隨機數轉化成N(1,4) 正態分佈：

scala> val v = u.map(x => 1.0 + 2.0 * x)
v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[36] at map at :50

七、核密度估計 Kernel density estimation

Spark ML 提供了一個工具類 KernelDensity 用於核密度估算，核密度估算的意思是根據已知的樣本估計未知的密度，屬於非引數檢驗方法之一。核密度估計的原理是。觀察某一事物的已知分佈，如果某一個數在觀察中出現了，可認為這個數的概率密度很大，和這個數比較近的數的概率密度也會比較大，而那些離這個數遠的數的概率密度會比較小。Spark1.6.2版本支援高斯核(Gaussian kernel)。

首先，匯入必要的包：

import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD

同時留意到已經匯入的資料：

scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at :44

用樣本資料構建核函式，這裡用假設檢驗中得到的iris的第一個屬性的資料作為樣本資料進行估計：


scala> val kd = new KernelDensity().setSample(test).setBandwidth(3.0)
kd: org.apache.spark.mllib.stat.KernelDensity = org.apache.spark.mllib.stat.KernelDensity@26216fa3

其中setBandwidth表示高斯核的寬度，為一個平滑引數，可以看做是高斯核的標準差。

構造了核密度估計kd，就可以對給定資料資料進行核估計：

scala> val densities = kd.estimate(Array(-1.0, 2.0, 5.0, 5.8))
densities: Array[Double] = Array(0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114)

這裡表示的是，在樣本-1.0, 2.0, 5.0, 5.8等樣本點上，其估算的概率密度函式值分別是：0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114。

spark廈大----基本的統計工具（2）

五、假設檢驗 Hypothesis testing

(一) 適合度檢驗 Goodness fo fit

（二）獨立性檢驗 Indenpendence

六、隨機數生成 Random data generation

七、核密度估計 Kernel density estimation

spark廈大----基本的統計工具（2）

R語言實戰 - 基本統計分析（1）- 描述性統計分析

#使用abp框架與vue一步一步寫我是月老的小工具（2）後臺搭建初體驗

Apache 流框架 Flink，Spark Streaming，Storm對比分析（2）

影象處理基本概念筆記（2）

機器學習十大經典演算法：（2）k-means演算法

Java和大資料結合學習（2）

通用工具（2）---Tuple

流式大資料計算實踐（2）----Hadoop叢集和Zookeeper

Spark Streaming實時流處理筆記（2）—— 實時處理介紹

Java開發小工具（2）（生成可雙擊執行的exe檔案）

資料預處理——基本統計描述（一）

Spark之訓練分類模型練習（2）

大資料運算系統（2）--- 圖計算系統

用wxpython來做自己的第一個介面小工具（2）

TensorFlow 入門第一課--基本函式學習（2）：tf.nn.conv2d 、tf.contrib.layers.flatten、tf.nn.max_pool 詳解

DevOps 10大IT管理工具（翻譯）

python學習記錄之---------PYQT5做工具（2）下拉選項框的使用

python 內建資料結構的基本操作 —— dict（2）

大資料分析工具（下）

spark廈大----基本的統計工具（2）

五、假設檢驗 Hypothesis testing

(一) 適合度檢驗 Goodness fo fit

（二）獨立性檢驗 Indenpendence

六、隨機數生成 Random data generation

七、核密度估計 Kernel density estimation

相關推薦