回歸模型效果評估系列1-QQ圖

阿新 • • 發佈：2018-03-01

們的 cap plt linspace sci ros 虛線 ati ntile

（erbqi）導語 QQ圖全稱 Quantile-Quantile圖，也就是分位數-分位數圖，簡單理解就是把兩個分布相同分位數的值，構成點(x,y)繪圖；如果兩個分布很接近，那個點(x,y)會分布在y=x直線附近；反之則不；可以通過QQ圖從整體評估回歸模型的預測效果

QQ圖一般有兩種，正態QQ圖和普通QQ圖，區別在於正態QQ圖中其中有一個分布是正態分布，下面來看下這兩種分布

正態QQ圖

下圖來自這裏使用Filliben‘s estimate來確定n分點

技術分享圖片

下面我們嘗試繪制正態QQ圖

使用開源庫自帶函數，很簡單，但是可能一些細節看不到

import numpy as np 
from matplotlib import pyplot as plt
import matplotlib
matplotlib.style.use(‘ggplot‘)
# 用正態分布隨機生死100個數據
x = np.round(np.random.normal(loc=0.0, scale=1.0, size=100),2)
from scipy.stats import probplot
f = plt.figure(figsize=(8, 6))
ax = f.add_subplot(111)
probplot(x, plot=ax)
plt.show()

下面展開一些細節，為下面我們的普通QQ做點鋪墊

import sys,os
import pandas as pd 
import numpy as np 
from scipy.stats import norm,linregress
from matplotlib import pyplot as plt
# 返回長度為len(x)的order_statistic_medians
def calc_uniform_order_statistic_medians(x):
    N = len(x)
    osm_uniform = np.zeros(N, dtype=np.float64)
    osm_uniform[ 
-1] = 0.5**(1.0 / N)
    osm_uniform[0] = 1 - osm_uniform[-1]
    i = np.arange(2, N)
    osm_uniform[1:-1] = (i - 0.3175) / (N + 0.365)
    return osm_uniform
# 用正態分布隨機生死100個數據
x = np.round(np.random.normal(loc=0.0, scale=1.0, size=100),2)
osm_uniform = calc_uniform_order_statistic_medians(x)
# ppf(Percent point function) 是 cdf(Cumulative distribution function) 的逆函數，就是取對應分位數對應的值
osm = norm.ppf(osm_uniform)
osr = np.sort(x)
# 計算osm和osr組合的樣本的線性回歸的 斜率 截距 等信息
slope, intercept, rvalue, pvalue, stderr = linregress(osm, osr)

plt.figure(figsize=(10,8))
plt.plot(osm, osr, ‘bo‘, osm, slope*osm + intercept, ‘r-‘)
plt.legend()
plt.show()

左圖是100個采樣點，右圖是1000個采樣點，對比可以發現，1000個采樣點的分布更接近直線y=x，也就是更擬合正態分布

技術分享圖片

普通QQ圖和正態不同的地方在於參考系不是正態分布而可能是任意分布的數據集，這正是我們要用的

下圖來自這裏

技術分享圖片

下圖是一個場景，虛線是真實的網絡變化，實線是簡單的平滑預測的結果，我希望通過普通QQ圖看下簡單的平滑預測的擬合效果

技術分享圖片

先看下兩個曲線的cdf圖( Fx(x)=P(X≤x) )，

這個圖的累計分布點是np.linspace(min(X), max(X), len(X))計算來的，看起來有點怪

技術分享圖片

我們重新計算以原始數據為累計分布點的cdf圖，發現有趣的地方了嗎？

技術分享圖片

在兩個曲線的數量一致的情況下，我們把兩組數據從小到大排序之後，相同位置對應的cdf的值是一樣的，

所以兩個曲線的數量一致的情況下，QQ圖只需要從小到大排序即可

技術分享圖片

可以看到，正式的網絡曲線和平滑預測曲線的QQ圖的斜率只有0.79，說明平滑預測的分布和源數據的分布差別還是挺大的。

最後是代碼

httpspeedavg = np.array([1821000, 2264000, 2209000, 2203000, 2306000, 2005000, 2428000,
       2246000, 1642000,  721000, 1125000, 1335000, 1367000, 1760000,
       1807000, 1761000, 1767000, 1723000, 1883000, 1645000, 1548000,
       1608000, 1372000, 1532000, 1485000, 1527000, 1618000, 1640000,
       1199000, 1627000, 1620000, 1770000, 1741000, 1744000, 1986000,
       1931000, 2410000, 2293000, 2199000, 1982000, 2036000, 2462000,
       2246000, 2071000, 2220000, 2062000, 1741000, 1624000, 1872000,
       1621000, 1426000, 1723000, 1735000, 1443000, 1735000, 2053000,
       1811000, 1958000, 1828000, 1763000, 2185000, 2267000, 2134000,
       2253000, 1719000, 1669000, 1973000, 1615000, 1839000, 1957000,
       1809000, 1799000, 1706000, 1549000, 1546000, 1692000, 2335000,
       2611000, 1855000, 2092000, 2029000, 1695000, 1379000, 2400000,
       2522000, 2140000, 2614000, 2399000, 2376000])

def smooth_(squences,period=5):
    res = []
    gap = period/2
    right = len(squences)
    for i in range(right):
        res.append(np.mean(squences[i-gap if i-gap > 0 else 0:i+gap if i+gap < right else right]))
    return res 

httpavg = np.round((1.0*httpspeedavg/1024/1024).tolist(),2)
smooth = np.round(smooth_((1.0*httpspeedavg/1024/1024).tolist(),5),2)

f = plt.figure(figsize=(8, 6))
ax = f.add_subplot(111)
probplot(smooth, plot=ax)
# plt.show()

f = plt.figure(figsize=(8, 6))
ax = f.add_subplot(111)
probplot(httpavg, plot=ax)
# plt.show()

import statsmodels.api as sm
plt.figure(figsize=(15,8))
ecdf = sm.distributions.ECDF(httpavg)
x = np.linspace(min(httpavg), max(httpavg), len(httpavg))
y = ecdf(x)
plt.plot(x, y, label=‘httpavg‘,color=‘blue‘,marker=‘.‘)
ecdf1 = sm.distributions.ECDF(smooth)
x1 = np.linspace(min(smooth), max(smooth), len(smooth))
y1 = ecdf1(x1)
plt.plot(x1, y1, label=‘smooth‘,color=‘red‘,marker=‘.‘)
plt.legend(loc=‘best‘)
# plt.show()
def cdf(l):
    res = []
    length = len(l)
    for i in range(length):
        res.append(1.0*(i+1)/length)
    return res
plt.figure(figsize=(15,8))
x = np.sort(httpavg)
y = cdf(x)
plt.plot(x, y, label=‘httpavg‘,color=‘blue‘,marker=‘.‘)
x1 = np.sort(smooth)
y1 = cdf(x1)
plt.plot(x1, y1, label=‘smooth‘,color=‘red‘,marker=‘.‘)
plt.legend(loc=‘best‘)
# plt.show()
from scipy.stats import norm,linregress
plt.figure(figsize=(10,8))
httpavg = np.sort(httpavg)
smooth  = np.sort(smooth)
slope, intercept, rvalue, pvalue, stderr = linregress(httpavg, smooth)
plt.plot(httpavg, smooth, ‘bo‘, httpavg, slope*httpavg + intercept, ‘r-‘)
xmin = np.amin(httpavg)
xmax = np.amax(httpavg)
ymin = np.amin(smooth)
ymax = np.amax(smooth)
posx = xmin + 0.50 * (xmax - xmin)
posy = ymin + 0.01 * (ymax - ymin)
plt.text(posx, posy, "$R^2=%1.4f$ y = %.2f *x + %.2f"  % (rvalue,slope,intercept))
plt.plot(httpavg,httpavg,color=‘green‘,label=‘y=x‘)
plt.legend(loc=‘best‘)
# plt.show()

回歸模型效果評估系列1-QQ圖

們的 cap plt linspace sci ros 虛線 ati ntile （erbqi）導語 QQ圖全稱 Quantile-Quantile圖，也就是分位數-分位數圖，簡單理解就是把兩個分布相同分位數的值，構成點(x,y)繪圖；如果兩個分布很接近，那個點(x,y)會

回歸模型效果評估系列1-QQ圖

正態QQ圖

回歸模型效果評估系列1-QQ圖

回歸模型效果評估系列2-MAE、MSE、RMSE、MAPE(MAPD)

人工智能 tensorflow框架-->Softmax回歸模型的訓練與評估 09

[吳恩達機器學習筆記]15.1-3非監督學習異常檢測算法/高斯回回歸模型

[吳恩達機器學習筆記]15.1-3非監督學習異常檢測演算法/高斯回回歸模型

衡量回歸模型的效果--R語言實現

Logistic回歸模型和Python實現

統計學習方法[6]——邏輯回歸模型

學習筆記 | 回歸模型 | 01 介紹

Spark 決策樹--回歸模型

基於tensorflow的簡單線性回歸模型

回歸模型構建

構建房屋預測回歸模型

tensorflow訓練線性回歸模型

線性回歸模型

吳裕雄數據挖掘與分析案例實戰（7）——嶺回歸與LASSO回歸模型

作業十三(回歸模型與房價預測)

回歸模型與房價預測

演算法模型---演算法調優---資料探勘模型效果評估方法彙總

分類模型效果評估

回歸模型效果評估系列1-QQ圖

正態QQ圖

相關推薦