威爾遜置信區間
阿新 • • 發佈:2020-07-21
由於正態區間對於小樣本並不可靠,因而,1927年,美國數學家 Edwin Bidwell Wilson提出了一個修正公式,被稱為“威爾遜區間”,很好地解決了小樣本的準確性問題。
根據離散型隨機變數的均值和方差定義:
μ=E(X)=0*(1-p)+1*p=p
σ=D(X)=(0-E(X))2(1-p)+(1-E(X))2p=p2(1-p)+(1-p)2p=p2-p3+p3-2p2+p=p-p2=p(1-p)
因此上面的威爾遜區間公式可以簡寫成:
程式碼:
def wilson_score(pos, total, p_z=2.): """ 威爾遜得分計算函式 參考:https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval :param pos: 正例數 :param total: 總數 :param p_z: 正太分佈的分位數 :return: 威爾遜得分 """ pos_rat = pos * 1. / total * 1. # 正例比率 score = (pos_rat + (np.square(p_z) / (2. * total)) - ((p_z / (2. * total)) * np.sqrt(4. * total * (1. - pos_rat) * pos_rat + np.square(p_z)))) / \ (1. + np.square(p_z) / total) return score
SQL實現程式碼:
#wilson_score SELECT widget_id, ((positive + 1.9208) / (positive + negative) - 1.96 * SQRT((positive * negative) / (positive + negative) + 0.9604) / (positive + negative)) / (1 + 3.8416 / (positive + negative)) AS ci_lower_bound FROM widgets WHERE positive + negative > 0 ORDER BY ci_lower_bound DESC; # SELECT widget_id, (positive - negative) AS net_positive_ratings FROM widgets ORDER BY net_positive_ratings DESC; # SELECT widget_id, positive / (positive + negative) AS average_rating FROM widgets ORDER BY average_rating DESC;
excel實現程式碼:
=IFERROR((([@[Up Votes]] + 1.9208) / ([@[Up Votes]] + [@[Down Votes]]) - 1.96 * SQRT(([@[Up Votes]] * [@[Down Votes]]) / ([@[Up Votes]] + [@[Down Votes]]) + 0.9604) / ([@[Up Votes]] + [@[Down Votes]])) / (1 + 3.8416 / ([@[Up Votes]] + [@[Down Votes]])),0)
星級評價排名
參考資料: