python 卡方檢驗原理及應用

阿新 • • 發佈：2019-01-12

卡方檢驗，或稱x2檢驗。

無關性假設：
假設我們有一堆新聞或者評論，需要判斷內容中包含某個詞（比如6得很）是否與該條新聞的情感歸屬（比如正向）是否有關，我們只需要簡單統計就可以獲得這樣的一個四格表：

組別  屬於正向    不屬於正向   合計
不包含6得很  19  24  43
包含6得很   34  10  44
合計  53  34  87

通過這個四格表我們得到的第一個資訊是：內容是否包含某個詞比如6得很確實對新聞是否屬於正向有統計上的差別，包含6得很的新聞屬於正向的比例更高，但我們還無法排除這個差別是否由於抽樣誤差導致。那麼首先假設內容是否包含6得很與新聞是否屬於正向是獨立無關的，隨機抽取一條新聞標題，屬於正向類別的概率是：(19 + 34) / (19 + 34 + 24 +10) = 60.9%

理論值四格表：
第二步，根據無關性假設生成新的理論值四格表：

組別  屬於正向    不屬於正向   合計
不包含6得很  43 * 0.609 = 26.2   43 * 0.391 = 16.8   43
包含6得很   44 * 0.609 = 26.8   44 * 0.391 = 17.2   44

顯然，如果兩個變數是獨立無關的，那麼四格表中的理論值與實際值的差異會非常小。

x2值的計算
這裡寫圖片描述

其中A為實際值，也就是第一個四格表裡的4個數據，T為理論值，也就是理論值四格表裡的4個數據。

x2用於衡量實際值與理論值的差異程度（也就是卡方檢驗的核心思想），包含了以下兩個資訊：

實際值與理論值偏差的絕對大小（由於平方的存在，差異是被放大的）
差異程度與理論值的相對大小

對上述場景可計算x2值為10.01。

卡方分佈的臨界值
既然已經得到了x2值，我們又怎麼知道x2值是否合理？也就是說，怎麼知道無關性假設是否可靠？答案是，通過查詢卡方分佈的臨界值表。
這裡需要用到一個自由度的概念，自由度等於V = (行數 - 1) * (列數 - 1)，對四格表，自由度V = 1。
對V = 1，卡方分佈的臨界概率是：

這裡寫圖片描述

顯然10.01 > 7.88，也就是內容是否包含6得很與新聞是否屬於正向無關的可能性小於0.5%，反過來，就是兩者相關的概率大於99.5%。

應用場景
卡方檢驗的一個典型應用場景是衡量特定條件下的分佈是否與理論分佈一致，比如：特定使用者某項指標的分佈與大盤的分佈是否差異很大，這時通過臨界概率可以合理又科學的篩選異常使用者。

另外，x2值描述了自變數與因變數之間的相關程度：x2值越大，相關程度也越大，所以很自然的可以利用x2值來做降維，保留相關程度大的變數。再回到剛才新聞情感分類的場景，如果我們希望獲取和正向類別相關性最強的100個詞，以後就按照內容是否包含這100個詞來確定新聞是否歸屬於正向，怎麼做？很簡單，對正向類所包含的每個詞按上述步驟計算x2值，然後按x2值排序，取x2值最大的100個詞。

#! /usr/bin/env python2.7
#coding=utf-8

"""
Use positive and negative review set as corpus to train a sentiment classifier.
This module use labeled positive and negative reviews as training set, then use nltk scikit-learn api to do classification task.
Aim to train a classifier automatically identifiy review's positive or negative sentiment, and use the probability as review helpfulness feature.

"""

from Preprocessing_module import textprocessing as tp
import pickle
import itertools
from random import shuffle

import nltk
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist

import sklearn
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.linear_model import LogisticRegression
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.metrics import accuracy_score


# 1. Load positive and negative review data
pos_review = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
neg = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)
zhong = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\zhong.xlsx",1,1)
pos = pos_review
neg = neg
zhong = zhong

"""
# Cut positive review to make it the same number of nagtive review (optional)

shuffle(pos_review)
size = int(len(pos_review)/2 - 18)

pos = pos_review[:size]
neg = neg

"""


# 2. Feature extraction function
# 2.1 Use all words as features
def bag_of_words(words):
    return dict([(word, True) for word in words])


# 2.2 Use bigrams as features (use chi square chose top 200 bigrams)
def bigrams(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return bag_of_words(bigrams)


# 2.3 Use words and bigrams as features (use chi square chose top 200 bigrams)
def bigram_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return bag_of_words(words + bigrams)


# 2.4 Use chi_sq to find most informative features of the review
# 2.4.1 First we should compute words or bigrams information score
def create_word_scores():
    posdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
    negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)
    zhongdata = negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\zhong.xlsx", 1, 1)

    posWords = list(itertools.chain(*posdata))
    negWords = list(itertools.chain(*negdata))
    zhongWords = list(itertools.chain(*zhongdata))

    word_fd = FreqDist()
    cond_word_fd = ConditionalFreqDist()
    for word in posWords:
        word_fd.inc(word)
        cond_word_fd['pos'].inc(word)
    for word in negWords:
        word_fd.inc(word)
        cond_word_fd['neg'].inc(word)
    for word in zhongWords:
        word_fd.inc(word)
        cond_word_fd['zhong'].inc(word)
    pos_word_count = cond_word_fd['pos'].N()
    neg_word_count = cond_word_fd['neg'].N()
    zhong_word_count = cond_word_fd['zhong'].N()
    #print zhong_word_count
    total_word_count = pos_word_count + neg_word_count + zhong_word_count

    word_scores = {}
    for word, freq in word_fd.iteritems():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
        zhong_score = BigramAssocMeasures.chi_sq(cond_word_fd['zhong'][word], (freq, zhong_word_count), total_word_count)
        word_scores[word] = pos_score + neg_score +zhong_score

    return word_scores

def create_bigram_scores():
    posdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
    negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)
    zhongdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\zhong.xlsx", 1, 1)

    posWords = list(itertools.chain(*posdata))
    negWords = list(itertools.chain(*negdata))
    zhongWords = list(itertools.chain(*zhongdata))

    bigram_finder = BigramCollocationFinder.from_words(posWords)
    bigram_finder = BigramCollocationFinder.from_words(negWords)
    bigram_finder = BigramCollocationFinder.from_words(zhongWords)
    posBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 8000)
    negBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 8000)
    zhongBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 8000)
    pos = posBigrams
    neg = negBigrams
    zhong = zhongBigrams
    word_fd = FreqDist()
    cond_word_fd = ConditionalFreqDist()
    for word in pos:
        word_fd.inc(word)
        cond_word_fd['pos'].inc(word)
    for word in neg:
        word_fd.inc(word)
        cond_word_fd['neg'].inc(word)
    for word in neg:
        word_fd.inc(word)
        cond_word_fd['zhong'].inc(word)
    pos_word_count = cond_word_fd['pos'].N()
    neg_word_count = cond_word_fd['neg'].N()
    zhong_word_count = cond_word_fd['zhong'].N()
    total_word_count = pos_word_count + neg_word_count + zhong_word_count

    word_scores = {}
    for word, freq in word_fd.iteritems():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
        zhong_score = BigramAssocMeasures.chi_sq(cond_word_fd['zhong'][word], (freq, neg_word_count), total_word_count)
        word_scores[word] = pos_score + neg_score + zhong_score

    return word_scores

# Combine words and bigrams and compute words and bigrams information scores
def create_word_bigram_scores():
    posdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\pos.xlsx", 1, 1)
    negdata = tp.seg_fil_senti_excel(r"D:\tomcat\review_protection\Feature_extraction_module\Sentiment_features\Machine learning features\seniment review set\neg.xlsx", 1, 1)

    posWords = list(itertools.chain(*posdata))
    negWords = list(itertools.chain(*negdata))

    bigram_finder = BigramCollocationFinder.from_words(posWords)
    bigram_finder = BigramCollocationFinder.from_words(negWords)
    posBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)
    negBigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 5000)

    pos = posWords + posBigrams
    neg = negWords + negBigrams

    word_fd = FreqDist()
    cond_word_fd = ConditionalFreqDist()
    for word in pos:
        word_fd.inc(word)
        cond_word_fd['pos'].inc(word)
    for word in neg:
        word_fd.inc(word)
        cond_word_fd['neg'].inc(word)

    pos_word_count = cond_word_fd['pos'].N()
    neg_word_count = cond_word_fd['neg'].N()
    total_word_count = pos_word_count + neg_word_count

    word_scores = {}
    for word, freq in word_fd.iteritems():
        pos_score = BigramAssocMeasures.chi_sq(cond_word_fd['pos'][word], (freq, pos_word_count), total_word_count)
        neg_score = BigramAssocMeasures.chi_sq(cond_word_fd['neg'][word], (freq, neg_word_count), total_word_count)
        word_scores[word] = pos_score + neg_score

    return word_scores

# Choose word_scores extaction methods
word_scores = create_word_scores()
#word_scores = create_bigram_scores()
# word_scores = create_word_bigram_scores()


# 2.4.2 Second we should extact the most informative words or bigrams based on the information score
def find_best_words(word_scores, number):
    best_vals = sorted(word_scores.iteritems(), key=lambda (w, s): s, reverse=True)[:number]
    best_words = set([w for w, s in best_vals])
    return best_words

# 2.4.3 Third we could use the most informative words and bigrams as machine learning features
# Use chi_sq to find most informative words of the review
def best_word_features(words):
    return dict([(word, True) for word in words if word in best_words])

# Use chi_sq to find most informative bigrams of the review
def best_word_features_bi(words):
    return dict([(word, True) for word in nltk.bigrams(words) if word in best_words])

# Use chi_sq to find most informative words and bigrams of the review
def best_word_features_com(words):
    d1 = dict([(word, True) for word in words if word in best_words])
    d2 = dict([(word, True) for word in nltk.bigrams(words) if word in best_words])
    d3 = dict(d1, **d2)
    return d3



# 3. Transform review to features by setting labels to words in review
def pos_features(feature_extraction_method):
    posFeatures = []
    #print "pos"
    for i in pos:
        #for key in feature_extraction_method(i):
            #print key
        posWords = [feature_extraction_method(i),'pos']
        posFeatures.append(posWords)
    return posFeatures

def neg_features(feature_extraction_method):
    negFeatures = []
    #print "neg"
    for j in neg:
        #for key in feature_extraction_method(j):
          # print key
        negWords = [feature_extraction_method(j),'neg']
        negFeatures.append(negWords)
    return negFeatures

def zhong_Features(feature_extraction_method):
    zhongFeatures = []
    print "zhong"
    for j in zhong:
        for key in feature_extraction_method(j):
            print key
        zhongWords = [feature_extraction_method(j),'zhong']
        zhongFeatures.append(zhongWords)
    return zhongFeatures
best_words = find_best_words(word_scores, 1000) # Set dimension and initiallize most informative words

posFeatures = pos_features(best_word_features_com)
negFeatures = neg_features(best_word_features_com)
zhongFeatures = zhong_Features(best_word_features_com)
# posFeatures = pos_features(bigram_words)
# negFeatures = neg_features(bigram_words)

#posFeatures = pos_features(best_word_features)
#print type(posFeatures)

#negFeatures = neg_features(best_word_features)
#zhongFeatures = zhong_Features(best_word_features)
# posFeatures = pos_features(best_word_features_com)
# negFeatures = neg_features(best_word_features_com)

結果如下：

中性情感詞：
普及
成為
家庭
資訊
成
事件
謀求
事情
產生
輿論
引發
發生
當而
方
場
人們
相信
世界
翻轉
反
公眾
羅爾
網路時代
規則
對稱
應該
面對
更要
對立
不要
媒體
微信
一輪
網路
號
換位
陣營
網友
都
技能
不明智
新聞
很
激烈
不
蘊藏
聰明反被聰明誤
相關
社交
大
刷屏
二小
學會
感到
越來越
人
清晰
中關村
破壞
微博
影響
彙集
屢有
不同
真實性
正向情感詞：

榮譽證書
去年
第二
名叫
年前
寫
會
節省
患
同意
收下
申請
爬
李慶國
當下
無疑
搶救
8
買藥
但確
湊
健康
難以
醫藥費
好心人
家裡
感謝
19
仍然
年
一年
廣懷
呼吸困難
公里
二
江蘇
學校
母親
長大
小時
狀態
看起來
常態
成功
暈倒
爬樓梯
3
出發
一拖再拖
無法
患有
想法
期限
趕到
減輕
出去
出現
火車
最早
父母



負向情感詞：
改革
往往
率
取消
月
說法
中止執行
會
搞
存在
事情
案件
工作
執行
資料
司法
視訊
後續
迴應
刑事拘留
指標
不合理
法
不
應該
法院
會議
日
這次
澎湃
之前
數
不算
考核
卻
告別
堅決
其實
屢屢
提出
必要
任務
手段
結案率
進行
執法
法官
撤案
審理
新
罰款
法律
不停
年
年底
專案
不是
杜絕
承德縣

python 卡方檢驗原理及應用

python 卡方檢驗原理及應用

機器學習中的數學(8)——卡方檢驗原理及應用

卡方檢驗思想及其應用

python Ridge 回歸（嶺回歸）的原理及應用

selenium + python自動化測試unittest框架學習（一）selenium原理及應用

卡爾曼濾波原理及應用（一）

Python資料預處理之---統計學的t檢驗，卡方檢驗以及均值，中位數等

Python統計分析-卡方檢驗

特徵選擇——卡方檢驗(使用Python sklearn進行實現)

Python邏輯回歸原理及實際案例應用

卡方分佈與卡方檢驗------以及python的實現

卡方檢驗和互信息

數據庫原理及應用——關系數據庫

數據庫原理及應用(SQL Server 2016數據處理)【上海精品視頻課程】

VUE -- JSONP的誕生、原理及應用實例

MyBatis的原理及應用

SpringAOP原理及應用

數據庫原理及應用-數據庫管理系統 DBMS

Docker五種存儲驅動原理及應用場景和性能測試對比

數學知識點查漏補缺（卡方分布與卡方檢驗）

python 卡方檢驗原理及應用

相關推薦