Spark 提供的概率與統計演算法

spark-mllib 使用到了Breeze線性代數包，Breeze 採用了 metlib-java包來優化數字處理，但是由於版權問題，Breeze 中預設不包含 netlib-java 的 native proxy，需要在專案中自行引用。 compile "org.apache.spark:spark-mllib_${scalaMajorVersion}:${sparkVersion}" compile "com.github.fommil.netlib:all:1.1.2" 以下是 spark-mllib 提供的幾種基本演算法一、 Summay statistics 對單個數組而言，Summay statistics 有最大（max）,最小（min）, 平均（mean）, 方差（variance），非零（nonzeros）, 總計（count）等演算法，它們通過 MultivariateStatusticaSummary 類來提供計算支援： importorg.apache.spark.mllib.linalg.Vector importorg.apache.spark.mllib.stat.{MultivariateStatisticalSummary,Statistics}


val observations: 
 RDD[Vector] = ... // an RDD of Vectors

// Compute column summary statistics. valsummary:MultivariateStatisticalSummary=Statistics.colStats(observations) println(summary.mean)// a dense vector containing the mean value for each column println(summary.variance)// column-wise variance println(summary.numNonzeros

)// number of nonzeros in each column 二、 Correlations Correlations 是兩個陣列之間的關係的計算，Spark 目前支援 Pearson 和 Spearman 兩類相關性分析。Spark 通過 Statistics 類提供了相關性分析的大部分演算法。支援兩類資料型別的輸入，分別是 RDD[Double] 和 RDD[Vector] 相應的輸出分別是 Double 和 Matrix Pearson 相關係數用來衡量兩個資料（集）是否在同一個層面上。統計學依據資料的計量尺度將資料劃分為四大類，即定距型資料（Interval Scale）、定序型資料（Ordinal Scale）、定型別資料（Nominal Scale）和定比型資料 (Ratio Scale)。Pearson 用來計算定距變數

，相關係數的絕對值越大，相關性越強：相關係數越接近於1或-1，相關度越強，相關係數越接近於0，相關度越弱。

0.8-1.0 極強相關
0.6-0.8 強相關
0.4-0.6 中等程度相關
0.2-0.4 弱相關
0.0-0.2 極弱相關或無相關

Spearman相關係數的介紹參考： importorg.apache.spark.SparkContext importorg.apache.spark.mllib.linalg._ importorg.apache.spark.mllib.stat.Statistics


val sc: SparkContext = ...

valseriesX:RDD[Double]=...// a series valseriesY:RDD[Double]=...// must have the same number of partitions and cardinality as seriesX // compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a // method is not specified, Pearson's method will be used by default. valcorrelation:Double=Statistics.corr(seriesX,seriesY,"pearson")


val data: RDD[Vector] = ... // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method. // If a method is not specified, Pearson's method will be used by default. valcorrelMatrix:Matrix=Statistics.corr(data,"pearson") 三、 Stratified sampling (分層取樣) Stratified sampling 演算法是直接整合到鍵值對型別 RDD[(K, V)] 的 sampleByKey 和 sampleByKeyExact 方法提供支援，無需通過額外的 spark.mllib 庫來支援。分層取樣顧名思義，就是將資料根據不同的特徵分成不同的組，然後按特定條件從不同的組中獲取樣本並重新組成新的陣列。因此 sampleByKey 方法需要作用於一個鍵值對陣列，其中 key 用於分類，value可以是任意數。然後我們通過 fractions 引數來定義分類條件和取樣機率，因此 fractions 引數被定義成一個 Map[K, Double] 型別，Key是鍵值的分層條件，Double 是該滿足條件的 Key 條件的取樣比例，1.0 代表 100%。 importorg.apache.spark.SparkContext importorg.apache.spark.SparkContext._ importorg.apache.spark.rdd.PairRDDFunctions


val sc: SparkContext = ...

valdata=...// an RDD[(K, V)] of any key value pairs valfractions:Map[K, Double]=...// specify the exact fraction desired from each key // Get an exact sample from each stratum valapproxSample=data.sampleByKey(withReplacement=false,fractions) valexactSample=data.sampleByKeyExact(withReplacement=false,fractions) // 假設存在以下陣列，第一個數字是userId, 後面的資料是某種業務資料 val list =List( (2147481832,23355149,1)， (2147481832,97301062,1)， (2147481832,21348702,1)， (2147481832,54102337,1)， (2147481832,16822060,1)， (2147481832,11382119,1)， (2147481832,85220256,1)， (2147481832,20137598,1)， (2147481832,48653889,1)， (2147481832,91918798,1)， ...) // 我們將它轉成Map, 以 userId 為 key，其他資料為 value val data = sc.parallelize(list.toSeq).map(x =>(x._1,(x._2,x._3))) // 然後抽取userId 的唯一值做為 fractions的 key，並且假設每個使用者的被取樣機率都是 80%. val fractions = data.map(_._1).distinct.map(x =>(x,0.8)).collectAsMap // 獲得取樣資料 val sampleData = data.sampleByKey(withReplacement =false,fractions) sampleByKey 和 sampleByKeyExact 的區別在於 sampleByKey 並不對過濾全量資料，因此只得到近似值，而 sampleByKeyExtra 會對全量資料做取樣計算，因此耗費大量的計算資源，但是結果會更準確。四、 Hypothesis testing (假定檢測) spark通過 Statistics 類來支援 Pearson's chi-squared （卡方檢測），主要是比較兩個及兩個以上樣本率( 構成比）以及兩個分類變數的關聯性分析。其根本思想就是在於比較理論頻數和實際頻數的吻合程度或擬合優度問題。卡方檢測有兩種用途，分別是“適配度檢定”（Goodness of fit）以及“獨立性檢定”（independence）。 Goodness fo fit（適合度檢驗）：執行多次試驗得到的觀測值，與假設的期望數相比較，觀察假設的期望值與實際觀測值之間的差距，稱為卡方適合度檢驗，即在於檢驗二者接近的程度。比如擲色子。 Indenpendence(獨立性檢驗)：卡方獨立性檢驗是用來檢驗兩個屬性間是否獨立。其中一個屬性做為行，另外一個做為列，通過貌似相關的關係考察其是否真實存在相關性。比如天氣溫變化和肺炎發病率。假定檢測的基本思路是，首先我們假定一個結論，然後為這個結論設定期望值，用實際觀察值來與這個值做對比，並設定一個閥值，如果計算結果大於閥值，則假定不成立，否則成立。根據以上表述，我們需要確定四個值： 1) 結論：結論一般是建立在零假設( Null Hypothesis)的基礎上的。零假設即認為觀測值與理論值的差異是由於隨機誤差所致。比如：“擲色子得到的各種結果概率相同”——這個結論顯然我們認定的前提是即便不同也是隨機因素導致。 2) 期望值：期望值也就是理論值，理論值可以是某種平均數，比如我們投擲120次色子，要維護結論正確，那麼每個數字的出現理論值就應該都是20 3) 觀測值：也就是實際得到的值 4) 閥值：閥值是根據自由度和顯著性水平計算出來的（excel 中的 chiinv() 函式）。自由度=(結果選項數-1)x(對比組數-1)，比如我們將兩組擲色子值做比較，那麼自由度就是(6-1)x(2-1)=5。顯著性水平(a)是原假設為正確的，而我們確把原假設當做錯誤加以拒絕，犯這種錯誤的概率，依據拒絕區間所可能承擔的風險來決定，一般選擇0.05或0.01。最後就是計算卡方值：卡方值是各組（觀測值－理論值）^2/理論值的總和。最後就是比較方差值和閥值。如果小於閥值則接受結論，否則拒絕結論。或者根據卡方值反算概率p值(excel 中的 chidist() 函式)，將它和顯著性水平比較，小於則拒絕，大於則接受。 importorg.apache.spark.SparkContext importorg.apache.spark.mllib.linalg._ importorg.apache.spark.mllib.regression.LabeledPoint importorg.apache.spark.mllib.stat.Statistics._


val sc: SparkContext = ...

/**************************************************/ val vec:Vector[Double]=...// a vector composed of the frequencies of events val expected:Vector[Double] =...// a vector composed of the frequencies of expected events // compute the goodness of fit. If a second vector to test against is not supplied as a parameter, valgoodnessOfFitForExpectedTestResult=Statistics.chiSqTest(vec, expected) println(goodnessOfFitForExpectedTestResult)// summary of the test including the p-value, degrees of freedom, // test statistic, the method used, and the null hypothesis. // the test runs against a uniform distribution(均勻分佈). // “均勻分佈” 自動計算出，所以無需提供 expected 引數，預設值為：1.0/size valgoodnessOfFitTestResult=Statistics.chiSqTest(vec) println(goodnessOfFitTestResult) /**************************************************/ valmat:Matrix=...// a contingency matrix. Matrix 中的元素為 Double // conduct Pearson's independence test on the input contingency matrix valindependenceTestResult=Statistics.chiSqTest(mat)

println(independenceTestResult) 
  // summary of the test including the p-value, degrees of freedom...

/**************************************************/

val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.

// The contingency table is constructed from the raw (feature, label) pairs and used to conduct // the independence test. Returns an array containing the ChiSquaredTestResult for every feature // against the label. valfeatureTestResults:

Spark 提供的概率與統計演算法

Spark 提供的概率與統計演算法

【概率與統計】正態分佈(Normal Distribution)

深度學習之數學基礎（概率與統計）

用python學概率與統計（第二章）描述性統計:表格法，圖形法

深度學習數學基礎介紹（二）概率與數理統計

基於概率與胡牌表的麻將AI演算法

n皇后概率演算法與確定演算法折衷考慮最後解法

概率與數理統計的發展前景

概率與數理統計學習總結三--條件概率、全概率、貝葉斯、離散型隨機變數

【Spark深入學習 -12】Spark程序設計與企業級應用案例02

【Spark深入學習 -14】Spark應用經驗與程序調優

Spark設計理念與基本架構

[轉] - Spark排錯與優化

poj 3744 Scout (Another) YYF I - 概率與期望 - 動態規劃 - 矩陣快速冪

Discovering Gold LightOJ - 1030 || 概率與期望求法區別

BZOJ 1426--收集郵票(概率與期望&DP)

【BZOJ3470】Freda’s Walk 概率與期望

Codeforces 866C Gotta Go Fast - 動態規劃 - 概率與期望 - 二分答案

簡練軟考知識點整理-項目風險概率與影響評估

概率與期望DP習題總結

Spark 提供的概率與統計演算法

相關推薦