Python資料正態性檢驗實現過程

阿新 • • 發佈：2020-04-20

在做資料分析或者統計的時候，經常需要進行資料正態性的檢驗，因為很多假設都是基於正態分佈的基礎之上的，例如：T檢驗。

在Python中，主要有以下檢驗正態性的方法：

1.scipy.stats.shapiro ——Shapiro-Wilk test，屬於專門用來做正態性檢驗的模組，其原假設：樣本資料符合正態分佈。

注：適用於小樣本。

其函式定位為：

def shapiro(x):
  """
  Perform the Shapiro-Wilk test for normality.

  The Shapiro-Wilk test tests the null hypothesis that the
  data was drawn from a normal distribution.

  Parameters
  ----------
  x : array_like
    Array of sample data.

  Returns
  -------
  W : float
    The test statistic.
  p-value : float
    The p-value for the hypothesis test.

x引數為樣本值序列，返回值中第一個為檢驗統計量，第二個為P值，當P值大於指定的顯著性水平，則接受原假設。

2.scipy.stats.kstest（K-S檢驗）：可以檢驗多種分佈，不止正態分佈，其原假設：資料符合正態分佈。

其函式定義為：

def kstest(rvs,cdf,args=(),N=20,alternative='two-sided',mode='approx'):
  """
  Perform the Kolmogorov-Smirnov test for goodness of fit.

  This performs a test of the distribution G(x) of an observed
  random variable against a given distribution F(x). Under the null
  hypothesis the two distributions are identical,G(x)=F(x). The
  alternative hypothesis can be either 'two-sided' (default),'less'
  or 'greater'. The KS test is only valid for continuous distributions.

  Parameters
  ----------
  rvs : str,array or callable
    If a string,it should be the name of a distribution in `scipy.stats`.
    If an array,it should be a 1-D array of observations of random
    variables.
    If a callable,it should be a function to generate random variables;
    it is required to have a keyword argument `size`.
  cdf : str or callable
    If a string,it should be the name of a distribution in `scipy.stats`.
    If `rvs` is a string then `cdf` can be False or the same as `rvs`.
    If a callable,that callable is used to calculate the cdf.
  args : tuple,sequence,optional
    Distribution parameters,used if `rvs` or `cdf` are strings.
  N : int,optional
    Sample size if `rvs` is string or callable. Default is 20.
  alternative : {'two-sided','less','greater'},optional
    Defines the alternative hypothesis (see explanation above).
    Default is 'two-sided'.
  mode : 'approx' (default) or 'asymp',optional
    Defines the distribution used for calculating the p-value.

     - 'approx' : use approximation to exact distribution of test statistic
     - 'asymp' : use asymptotic distribution of test statistic

  Returns
  -------
  statistic : float
    KS test statistic,either D,D+ or D-.
  pvalue : float
    One-tailed or two-tailed p-value.

引數是：

rvs：待檢驗資料。

cdf：檢驗分佈，例如'norm'，'expon'，'rayleigh'，'gamma'等分佈，設定為'norm'時表示正態分佈。

alternative：預設為雙側檢驗，可以設定為'less'或'greater'作單側檢驗。

model:'approx'(預設值)，表示使用檢驗統計量的精確分佈的近視值；'asymp'：使用檢驗統計量的漸進分佈。

其返回值中第一個為統計量，第二個為P值。

3.scipy.stats.normaltest：正態性檢驗，其原假設：樣本來自正態分佈。

其函式定義為：

def normaltest(a,axis=0,nan_policy='propagate'):
  """
  Test whether a sample differs from a normal distribution.

  This function tests the null hypothesis that a sample comes
  from a normal distribution. It is based on D'Agostino and
  Pearson's [1]_,[2]_ test that combines skew and kurtosis to
  produce an omnibus test of normality.


  Parameters
  ----------
  a : array_like
    The array containing the sample to be tested.
  axis : int or None,optional
    Axis along which to compute test. Default is 0. If None,compute over the whole array `a`.
  nan_policy : {'propagate','raise','omit'},optional
    Defines how to handle when input contains nan. 'propagate' returns nan,'raise' throws an error,'omit' performs the calculations ignoring nan
    values. Default is 'propagate'.

  Returns
  -------
  statistic : float or array
    ``s^2 + k^2``,where ``s`` is the z-score returned by `skewtest` and
    ``k`` is the z-score returned by `kurtosistest`.
  pvalue : float or array
    A 2-sided chi squared probability for the hypothesis test.

其引數：

axis=None 可以表示對整個資料做檢驗，預設值是0。

nan_policy：當輸入的資料中有nan時，'propagate'，返回空值；'raise' 時，丟擲錯誤；'omit' 時，忽略空值。

其返回值中，第一個是統計量，第二個是P值。

4.scipy.stats.anderson：由 scipy.stats.kstest 改進而來，用於檢驗樣本是否屬於某一分佈（正態分佈、指數分佈、logistic 或者 Gumbel等分佈）

其函式定義為：

def anderson(x,dist='norm'):
  """
  Anderson-Darling test for data coming from a particular distribution

  The Anderson-Darling tests the null hypothesis that a sample is
  drawn from a population that follows a particular distribution.
  For the Anderson-Darling test,the critical values depend on
  which distribution is being tested against. This function works
  for normal,exponential,logistic,or Gumbel (Extreme Value
  Type I) distributions.

  Parameters
  ----------
  x : array_like
    array of sample data
  dist : {'norm','expon','logistic','gumbel','gumbel_l',gumbel_r','extreme1'},optional
    the type of distribution to test against. The default is 'norm'
    and 'extreme1','gumbel_l' and 'gumbel' are synonyms.

  Returns
  -------
  statistic : float
    The Anderson-Darling test statistic
  critical_values : list
    The critical values for this distribution
  significance_level : list
    The significance levels for the corresponding critical values
    in percents. The function returns critical values for a
    differing set of significance levels depending on the
    distribution that is being tested against.

其引數：

x和dist分別表示樣本資料和分佈。

返回值有三個，第一個表示統計值，第二個表示評價值，第三個是顯著性水平；評價值和顯著性水平對應。

對於不同的分佈，顯著性水平不一樣。

Critical values provided are for the following significance levels:

  normal/exponenential
    15%,10%,5%,2.5%,1%
  logistic
    25%,1%,0.5%
  Gumbel
    25%,1%

關於統計值與評價值的對比：當統計值大於這些評價值時，表示在對應的顯著性水平下，原假設被拒絕，即不屬於某分佈。

If the returned statistic is larger than these critical values then for the corresponding significance level,the null hypothesis that the data come from the chosen distribution can be rejected.

5.skewtest 和kurtosistest 檢驗：用於檢驗樣本的skew（偏度）和kurtosis（峰度）是否與正態分佈一致，因為正態分佈的偏度=0，峰度=3。

偏度：偏度是樣本的標準三階中心矩。

Python資料正態性檢驗實現過程

峰度：峰度是樣本的標準四階中心矩。

Python資料正態性檢驗實現過程

6. 程式碼如下：

import numpy as np
from scipy import stats

a = np.random.normal(0,2,50)
b = np.linspace(0,10,100)

# Shapiro-Wilk test
S,p = stats.shapiro(a)
print('the shapiro test result is:',S,',p)

# kstest（K-S檢驗）
K,p = stats.kstest(a,'norm')
print(K,p)

# normaltest
N,p = stats.normaltest(b)
print(N,p)

# Anderson-Darling test
A,C,p = stats.anderson(b,dist='norm')
print(A,p)

以上就是本文的全部內容，希望對大家的學習有所幫助，也希望大家多多支援我們。

Python資料正態性檢驗實現過程

在做資料分析或者統計的時候，經常需要進行資料正態性的檢驗，因為很多假設都是基於正態分佈的基礎之上的，例如：T檢驗。

在python中做正態性檢驗示例

利用觀測資料判斷總體是否服從正態分佈的檢驗稱為正態性檢驗，它是統計判決中重要的一種特殊的擬合優度假設檢驗。

R語言使用蒙特卡洛模擬進行正態性檢驗及視覺化

原文連結：http://tecdat.cn/?p=14601 如何使用蒙特卡洛模擬來推導隨機變數可能的分佈，我們回到統計資料（無協變數）進行說明。我們假設觀察值是基礎隨機變數，具有未知分佈的隨機變數。

R-正態性檢驗例項

資料：price.csv檔案（一列價格差值的資料，包含標題）問題描述：利用price.csv資料繪製資料直方圖，並新增概率密度曲線(density)和估計概率密度曲線(dnorm) 。

python 生成正態分佈資料,並繪圖和解析

1、生成正態分佈資料並繪製概率分佈圖 import pandas as pd import numpy as np import matplotlib.pyplot as plt

20211006 多種資料分析正態分佈檢驗

1 直方圖適合資料多的 2 pp圖 NORMDIST 值->AP累計概率 x出現概率 y，對於正態分佈曲線，當其點對應的數值等於第一個實際值出現，概率面積的累計大小

Python求解正態分佈置信區間教程

正態分佈和置信區間正態分佈（Normal Distribution）又叫高斯分佈，是一種非常重要的概率分佈。其概率密度函式的數學表達如下：

Python求正態分佈曲線下面積例項

正態分佈應用最廣泛的連續概率分佈，其特徵是“鍾”形曲線。這種分佈的概率密度函式為：

Python FTP檔案定時自動下載實現過程解析

這篇文章主要介紹了Python FTP檔案定時自動下載實現過程解析,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

python 資料提取及拆分的實現程式碼

K線資料提取依據原有資料集格式，按要求生成新表: 1、每分鐘的close資料的第一條、最後一條、最大值及最小值，

Python自定義計算時間過濾器實現過程解析

這篇文章主要介紹了Python自定義計算時間過濾器實現過程解析,文中通過示例程式碼介紹的非常詳細，對大家的學習或者工作具有一定的參考學習價值,需要的朋友可以參考下

Python執行緒協作threading.Condition實現過程解析

領會下面這個示例吧,其實跟java中wait/nofity是一樣一樣的道理 import threading # 條件變數,用於複雜的執行緒間同步鎖

Python異常原理及異常捕捉實現過程解析

關於選課程式，最近著實有點忙，沒機會複習os、pickle兩部分模組，所以資料儲存和字典讀取成為了一個問題，大致原理知道，但是具體操作可能還是得返回去再好好看看，所以目前就提前開始學習新的知識了，雖然今天感覺

Python實時監控網站瀏覽記錄實現過程詳解

需求： (1) 獲取你物件chrome前一天的瀏覽記錄中的所有網址(url)和訪問時間，並存在一個txt檔案中

Python自動巡檢H3C交換機實現過程解析

1.通過netmiko模組登入交換機，協議ssh，執行收集資訊命令，儲存至txt檔案 2.過濾txt檔案中的內容，儲存到excel，使用xlwt模組實現。

python 繪製正態曲線的示例

import numpy as np import matplotlib.pyplot as plt import math # Python實現正態分佈 # 繪製正態分佈概率密度函式

python基礎教程Django nginx配置實現過程詳解

更多python教程請到：菜鳥教程https://www.piaodoo.com/ django 在引入第三方模組的時候保證服務的高可用，要設立一個備份介面，當主介面宕機時可以設定一個超市引數來使用備份的介面。

Python urllib request模組傳送請求實現過程解析

1.Request()的引數 import urllib.request request=urllib.request.Request(\'https://python.org\') response=urllib.request.urlopen(request)

利用python繪製正態分佈曲線

使用Python繪製正態分佈曲線，藉助matplotlib繪圖工具； #-*-coding:utf-8-*- \"\"\" python繪製標準正態分佈曲線

拓端tecdat|R語言視覺化漸近正態性、收斂性：大數定律、中心極限定理、經驗累積分佈函式

原文連結： http://tecdat.cn/?p=23777 原文出處：拓端資料部落公眾號在我們的數理統計課程中，已經看到了大數定律（這在概率課程中已經被證明），證明

Python資料正態性檢驗實現過程

相關推薦