[Python] 如何證明兩組樣本有顯著性差異(t-test假設檢驗)

阿新 • • 發佈：2019-01-24

現有兩組樣本資料，假如它們分別基於兩套不同的方法，或者測於不同的裝置，又或是出自兩個人之手，如何證明它們有或沒有顯著性差別呢？當然可以拿個Excel表把資料畫個圖，然後找一堆人來投票，看覺得差不多還是覺得差得多的人哪方票數高。但終歸這種做法有些主觀，不夠說明力。概率統計給出了一種更為客觀的基於統計的方法，這裡是一個Python的實現：

#!/usr/bin/python
# Paired difference hypothesis testing - Student's t-test
# Author:       Jin Zhuojun
# History:      Create      Tue May  8 16:12:21 CST 2012

import string
import math
import sys

from scipy.stats import  t
import matplotlib.pyplot as plt
import numpy as np

##############
# Parameters #
##############
ver = 1
verbose = 0
alpha = 0.05

def usage():
    print """
    usage: ./program data_file(one sample in one line)
    """

def main():

    ##########
    # Sample #
    ##########
    if (len(sys.argv) < 2):
        usage()
        sys.exit()

    # f = open('./data.txt')
    # f = open('./test.txt')
    f = open(sys.argv[1])

    try:
        sample1_text = f.readline()
        sample2_text = f.readline()
    finally:
        f.close()

    if (verbose):
        print("sample1 text: ", sample1_text)
        print("sample2 text: ", sample2_text)

    sample1_text_split = sample1_text.split()
    sample2_text_split = sample2_text.split()

    if (verbose):
        print(sample1_text_split)
        print(sample2_text_split)
        print("len = ", len(sample1_text_split))

    assert(len(sample1_text_split) == len(sample2_text_split))
    sample_len = len(sample1_text_split)

    sample1 = []
    sample2 = []

    for i in range(sample_len):
        sample1.append(string.atof(sample1_text_split[i]))
        sample2.append(string.atof(sample2_text_split[i]))

    sample_diff = []

    for i in range(sample_len):
        sample_diff.append(sample1[i] - sample2[i])

    if (verbose):
        print("sample_diff = ", sample_diff)


    ######################
    # Hypothesis testing #
    ######################
    sample = sample_diff

    numargs = t.numargs
    [ df ] = [sample_len - 1,] * numargs
    if (verbose):
        print("df(degree of freedom, student's t distribution parameter) = ", df)

    sample_mean = np.mean(sample)
    sample_std = np.std(sample, dtype=np.float64, ddof=1)
    if (verbose):
        print("mean = %f, std = %f" % (sample_mean, sample_std))

    abs_t = math.fabs( sample_mean / (sample_std / math.sqrt(sample_len)) )
    if (verbose):
        print("t = ", abs_t)

    t_alpha_percentile = t.ppf(1 - alpha / 2, df)

    if (verbose):
        print("abs_t = ", abs_t)
        print("t_alpha_percentile = ", t_alpha_percentile)

    if (abs_t >= t_alpha_percentile):
        print "REJECT the null hypothesis"
    else:
        print "ACCEPT the null hypothesis"

    ########
    # Plot #
    ########
    rv = t(df)
    limit = np.minimum(rv.dist.b, 5)
    x = np.linspace(-1 * limit, limit)
    h = plt.plot(x, rv.pdf(x))
    plt.xlabel('x')
    plt.ylabel('t(x)')
    plt.title('Difference significance test')
    plt.grid(True)
    plt.axvline(x = t_alpha_percentile, ymin = 0, ymax = 0.095, 
            linewidth=2, color='r')
    plt.axvline(x = abs_t, ymin = 0, ymax = 0.6, 
            linewidth=2, color='g')

    plt.annotate(r'(1 - $\alpha$ / 2) percentile', xy = (t_alpha_percentile, 0.05),
            xytext=(t_alpha_percentile + 0.5, 0.09), arrowprops=dict(facecolor = 'black', shrink = 0.05),)
    plt.annotate('t value', xy = (abs_t, 0.26),
            xytext=(abs_t + 0.5, 0.30), arrowprops=dict(facecolor = 'black', shrink = 0.05),)

    leg = plt.legend(('Student\'s t distribution', r'(1 - $\alpha$ / 2) percentile', 't value'), 
            'upper left', shadow = True)
    frame = leg.get_frame()
    frame.set_facecolor('0.80')
    for i in leg.get_texts():
        i.set_fontsize('small')

    for l in leg.get_lines():
        l.set_linewidth(1.5)

    normalized_sample = [0] * sample_len 
    for i in range(0, sample_len):
        normalized_sample[i] = (sample[i] - sample_mean) / (sample_std / math.sqrt(sample_len))
    plt.plot(normalized_sample, [0] * len(normalized_sample), 'ro')
    plt.show()

if __name__ == "__main__":
    main()

注：因為matplotlib顯示時座標比例老是不對，所以畫圖的時候把座標寫死了。

把上面的指令碼存為main.py，再如資料檔案為data.txt（每行表示一組樣本）:

0.26 0.30 0.40 0.49 0.43 0.84 1.18 0.25 0.67
0.18 0.33 0.57 0.32 0.58 0.29 0.98 0.17 0.79

執行：

$ ./main.py data.txt

得到

這裡假設null hypothesis為兩組資料無顯著性差異，alternative hypothesis為有顯著性差異。如圖，t value落在了拒絕域外，因此我們拒絕alternative hypothesis，接受null hypothesis，即兩組資料無顯著性差異。

[Python] 如何證明兩組樣本有顯著性差異(t-test假設檢驗)

[Python] 如何證明兩組樣本有顯著性差異(t-test假設檢驗)

如何判斷組之間是否有顯著性差異？

Python 比較兩個數組的元素的異同

r語言中如何進行兩組獨立樣本秩和檢驗

r語言中如何進行兩組獨立樣本秩和檢驗2

python 判斷兩個列表是否有公共元素

兩組raid5兩塊盤掉線資料恢復成功案例-有方案

兩組raid5兩塊盤掉線數據恢復成功案例-有方案

兩個遞增有序數組合併成一個遞增陣列

用python 合併兩個已經排好序的陣列並且新陣列不能有重複的元素

Python將兩個數組合併成一個數組，多維陣列變成一維陣列

python 中兩個list是否有相同元素的辦法？

python將兩個數組合並成一個數組的兩種方法的代碼

python遍歷數組的兩種方法的代碼

python 求兩個有序數組合並後的中位數

android 加密手機完畢後待機兩分鐘出現有頻率的雜音

【Python】數組排序

python 字典,元組,對象,數組取值方法

python 比較兩文件夾的內容,具有通用性。

python基礎-元組

[Python] 如何證明兩組樣本有顯著性差異(t-test假設檢驗)

相關推薦