[二]機器學習之迴歸

阿新 • • 發佈：2018-11-11

2.1 線性迴歸

2.1.1 實驗資料

1.資料描述

資料來自出版書籍《An Introduction to Statistical Learning with Applications in R》(Springer,2013)，作者Gareth James,Daniela Witten,Trevor Hastie and Robert Tibshirani。共200條資料，每條資料4個屬性。

資料下載地址：http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv

2.資料集資訊

資料共4列200行，每一行為一個特定的商品，前3列為輸入特徵，最後一列為輸出特徵。

輸入特徵：

TV：該商品用於電視上的廣告費用(千元，下同)

Radio：在廣播媒體上投資的廣告費用

Newspaper：用於報紙媒體的廣告費用

輸出特徵：

Sale：該商品的銷量

3.資料樣例

2.1.2 實驗過程

執行python

1.收集、準備資料

import pandas as pd
data = pd.read_csv("Advertising.csv")
data.head()#顯示前5行

檢視資料集大小

data.shape

2.分析資料

import matplotlib.pyplot as plt
import pandas as pd
if __name__ == "__main__":
    path = "Advertising.csv"
#pandas讀入資料
    data = pd.read_csv(path)
    x = data[['TV','radio','newspaper']]
    y = data['sales']
    plt.figure(figsize=(9,12))
    plt.subplot(311)
    plt.plot(data['TV'],y,'ro')
    plt.title('TV')
    plt.grid()
    plt.subplot(312)
    plt.plot(data['radio'],y,'g^')
    plt.title('radio')
    plt.grid()
    plt.subplot(313)
    plt.plot(data['newspaper'],y,'b*')
    plt.title('newspaper')
    plt.grid()
    plt.tight_layout()
    plt.show()

得到繪圖：

3.使用pandas構建特徵向量x和列標籤y

feature_cols = ['TV','radio','newspaper']
X = data[feature_cols]
print X.head()
print type(X)
print X.shape
y = data['sales']
print y.head()

結果如下：

4.構建訓練集與測試集

from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
#預設75%為訓練集，25%為測試集
print X_train.shape
print y_train.shape
print X_test.shape
print y_test.shape

5.sklearn線性迴歸

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
print model
print linreg.intercept_
print linreg.coef_
zip(feature_cols,linreg.coef_)

由此，可以得到各項係數：y=2.8769+0.0465*TV+0.1791*radio+0.00345*newspaper

6.預測

y_pred = linreg.predict(X_test)
print y_pred
print type(y_pred)

7.迴歸問題的評價測度

對於分類問題，評價測度(evalution metrics)是準確率，但這種方法不適用與迴歸問題。我們使用連續數值的評價測度

(1)平均絕對誤差(Mean Absolute Error,MAE)

(2)均方誤差(Mean Squared Error,MSE)

(3)均方根誤差(Root Mean Squared Error,RMSE)

此處使用RMSE：

print type(y_pred),type(y_test)
print len(y_pred),len(y_test)
print y_pred.shape,y_test.shape
from sklearn import metrics
import numpy as np
sum_mean=0
for i in range(len(y_pred)):
    sum_mean += (y_pred[i]-y_test.values[i])**2

print "RMSE by hand:",np.sqrt(sum_mean/len(y_pred))

8.作圖

import matplotlib.pyplot as plt
plt.figure()
plt.plot(range(len(y_pred)),y_pred,'b',label="predict")#藍色線表示預測值
plt.plot(range(len(y_pred)),y_test,'r',label="test")#紅色線為真實值
plt.legend(loc="upper right")#右上角顯示標籤
plt.xlabel("the number of sales")
plt.ylabel("value of sales")
plt.show()

2.1.3 結果分析

根據結果y=2.8769+0.0465*TV+0.1791*radio+0.00345*newspaper可以看出，newspaper的係數很小，再觀察收益-newspaper散點圖，我們發現newspaper的線性關係不明顯，因此我們可以嘗試去除這個特徵，看看回歸預測的結果如何。

feature_cols = ['TV','radio']
X = data[feature_cols]
from sklearn.cross_validation import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
model = linreg.fit(X_train,y_train)
zip(feature_cols,linreg.coef_)
y_pred = linreg.predict(X_test)
sum_mean=0
for i in range(len(y_pred)):
    sum_mean += (y_pred[i]-y_test.values[i])**2

print "RMSE by hand:",np.sqrt(sum_mean/len(y_pred))
plt.figure()
plt.plot(range(len(y_pred)),y_pred,'b',label="predict")#藍色線表示預測值
plt.plot(range(len(y_pred)),y_test,'r',label="test")#紅色線為真實值
plt.legend(loc="upper right")#右上角顯示標籤
plt.xlabel("the number of sales")
plt.ylabel("value of sales")
plt.show()

測得結果為：1.387

預測值與真實值的關聯圖如下：

在移除newspaper特徵之後，得到的RMSE值變小了，說明newspaper特徵可能不適合作為預測銷量的特徵，因此，我們得到了新的模型。

2.1.4 注意事項

本模型雖然簡單，但它涵蓋了機器學習的相當部分內容，如使用75%的訓練集和25%的測試集，這往往是探索機器學習的第一步。得到的線性模型發現有負權，我們使用最為簡單的方法：直接刪除；但這樣做，仍然得到了更好的預測結果。

在機器學習中，由“奧卡姆剃刀”原理：如果能夠用簡單模型解決問題，則不用複雜的模型，因為複雜模型往往增加了不確定性，造成過多的成本浪費，且容易過擬合。

2.2 Logistic迴歸

2.2.1實驗資料

鳶尾花資料集或許是最有名的模式識別測試資料。該資料集包括3個鳶尾花類別，每個類別50個樣本，其中一個類別是與另外兩類線性可分的，而另外兩類線性不可分。

由於最原始的資料集存在兩個錯誤（35號和38號樣本），因此我們在試驗中使用的是修正過的資料。

資料下載地址：http://archive.ics.uci.edu/ml/dataset/Iris

2.2.2實驗過程

（一）資料描述

該資料集共包含150行，每行1個樣本，每個樣本有5個欄位：花萼長度(cm)，花萼寬度(cm)，花瓣長度(cm)，花瓣寬度(cm)，類別（三種，Iris Setosa，Iris Versicolor，Iris Virginica）

資料集特徵	多變數	記錄數	150
屬性特徵	實數	屬性數目	4
相關應用	分類	缺失值	無

（二）實驗程式碼

import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

def iris_type(s):
    it = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
    return it[s]

if __name__ == "__main__":
    path = 'iris.data'#資料檔案路徑
    #路徑，浮點型資料，逗號分隔，第4列用函式iris_type單獨處理
    data = np.loadtxt(path,dtype=float,delimiter=',',converters={4:iris_type})
    #將資料的0-3列組成x，第4列得到y
    x,y = np.split(data,(4,),axis=1)
    #為了視覺化，僅使用前兩列特徵
    x = x[:,:2]
    #Logistic迴歸模型
    logreg = LogisticRegression()
    #根據資料[x,y]，計算迴歸引數
    logreg.fit(x,y.ravel())
    #畫圖
    #橫縱各取樣多少個值
    N,M = 500,500
    #得到第0列範圍
    x1_min,x1_max = x[:,0].min(),x[:,0].max()
    #得到第1列範圍
    x2_min,x2_max = x[:,1].min(),x[:,1].max()
    t1 = np.linspace(x1_min,x1_max,N)
    t2 = np.linspace(x2_min,x2_max,M)
    #生成網格取樣點
    x1,x2 = np.meshgrid(t1,t2)
    #測試點
    x_test = np.stack((x1.flat,x2.flat),axis=1)
    #預測值
    y_hat = logreg.predict(x_test)
    #使之與輸入形狀相同
    y_hat = y_hat.reshape(x1.shape)
    #預測值的顯示
    plt.pcolormesh(x1,x2,y_hat,cmap=plt.cm.prism)
    plt.scatter(x[:,0],x[:,1],c=np.squeeze(y),edgecolors='k',cmap=plt.cm.prism)
    #顯示樣本
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(x1_min,x1_max)
    plt.ylim(x2_min,x2_max)
    plt.grid()
    plt.show()
    #訓練集上的預測結果
    y_hat = logreg.predict(x)
    y = y.reshape(-1)
    print y_hat.shape
    print y.shape
    result = y_hat == y
    print y_hat
    print y
    print result
    c = np.count_nonzero(result)
    print c
    print 'Accuracy: %.2f%%' %(100*float(c)/float(len(result)))

2.2.3結果分析

（一）實驗結果

（二）結果分析

1.僅僅使用兩個特徵：花萼長度和寬度，在150個樣本中，有115個分類正確，正確率為76.67%。

2.當我們使用更多特徵（4個特徵全部使用），再次執行程式：

import numpy as np
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt

def iris_type(s):
    it = {'Iris-setosa':0,'Iris-versicolor':1,'Iris-virginica':2}
    return it[s]

if __name__ == "__main__":
    path = 'iris.data'#資料檔案路徑
    #路徑，浮點型資料，逗號分隔，第4列用函式iris_type單獨處理
    data = np.loadtxt(path,dtype=float,delimiter=',',converters={4:iris_type})
    #將資料的0-3列組成x，第4列得到y
    x,y = np.split(data,(4,),axis=1)
    #為了視覺化，僅使用前兩列特徵
    #x = x[:,:2]
    #Logistic迴歸模型
    logreg = LogisticRegression()
    #根據資料[x,y]，計算迴歸引數
    logreg.fit(x,y.ravel())
    #畫圖
    #橫縱各取樣多少個值
    N,M,P,Q = 100,100,100,100
    #得到第0列範圍
    x1_min,x1_max = x[:,0].min(),x[:,0].max()
    #得到第1列範圍
    x2_min,x2_max = x[:,1].min(),x[:,1].max()
    #得到第2列範圍
    x3_min,x3_max = x[:,2].min(),x[:,2].max()
    #得到第3列範圍
    x4_min,x4_max = x[:,3].min(),x[:,3].max()
    t1 = np.linspace(x1_min,x1_max,N)
    t2 = np.linspace(x2_min,x2_max,M)
    t3 = np.linspace(x3_min,x3_max,P)
    t4 = np.linspace(x4_min,x4_max,Q)
    #生成網格取樣點
    x1,x2,x3,x4 = np.meshgrid(t1,t2,t3,t4)
    #測試點
    x_test = np.stack((x1.flat,x2.flat,x3.flat,x4.flat),axis=1)
    #預測值
    y_hat = logreg.predict(x_test)
    #使之與輸入形狀相同
    y_hat = y_hat.reshape(x1.shape)
    #訓練集上的預測結果
    y_hat = logreg.predict(x)
    y = y.reshape(-1)
    print y_hat.shape
    print y.shape
    result = y_hat == y
    print y_hat
    print y
    print result
    c = np.count_nonzero(result)
    print c
    print 'Accuracy: %.2f%%' %(100*float(c)/float(len(result)))

可以發現，在150個樣本中，有144個分類正確，正確率為96.00%。

[二]機器學習之迴歸

[二]機器學習之迴歸

Python實現機器學習之迴歸分析

機器學習之迴歸決策樹DecisionTreeRegressor

機器學習之迴歸（2）多項式迴歸

機器學習之迴歸（1）線性迴歸

機器學習之數學系列（二）邏輯迴歸反向傳播數學推導

機器學習之邏輯迴歸（二）

（原創）(二)機器學習筆記之數據預處理

輕松入門機器學習之概念總結（二）

機器學習之決策樹（二）

機器學習之numpy和matplotlib學習（十二）

機器學習之優雅落地線性迴歸法

機器學習之旅（二）

系統學習機器學習之特徵工程（二）--離散型特徵編碼方式：LabelEncoder、one-hot與啞變數*

系統學習機器學習之總結（二）--機器學習演算法比較

系統學習機器學習之隨機場(二)--MEMM

機器學習之用Python進行邏輯迴歸分析

機器學習之一元線性迴歸

機器學習之分類和迴歸區別闡述

JavaScript機器學習之線性迴歸

[二]機器學習之迴歸

相關推薦