sklearn庫學習之線性模型
線性模型利用輸入特徵的線性函式進行預測,學習線性模型的演算法的區別:
(1)係數和截距的特定組合對訓練資料擬合好壞的度量方法,不同的演算法使用不同的方法度量“對訓練集擬合好壞”–稱為損失函式
(2)是否使用正則化,使用哪種正則化方法
線性模型的主要引數是正則化引數,如果假定只有幾個特徵是真正重要的,應該用L1正則化,否則應預設使用L2正則化。
處理大型資料時,需研究使用LogisticRegression和Ridge模型的solver='sag’選項,比預設值要更快。
用於迴歸的線性模型
是單個數據點的特徵,
是每個特徵座標軸的斜率或輸入特徵的加權,
是學習模型的引數,
是模型預測的結果。
在一維wave資料集上學習引數
:
import mglearn
mglearn.plots.plot_linear_regression_wave()
線性迴歸(普通最小二乘法)
線性迴歸尋找引數 是的對訓練集的預測值與真實的迴歸目標 之間的均方誤差最小線性迴歸尋找引數 是的對訓練集的預測值與真實的迴歸目標 之間的均方誤差最小
均方誤差:預測值與真實值之差的平方和除以樣本數
#線性迴歸對wave資料集的預測結果
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn
X,y = mglearn.datasets.make_wave(n_samples = 60)
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 42)
lr = LinearRegression().fit(X_train,y_train)
#sklearn庫總是將從訓練資料中得出的數值儲存在以下劃線結尾的屬性中,與使用者設定的引數區分開
print('lr.coef_:{}'.format(lr.coef_))
print('lr.intercept_:{}'.format(lr.intercept_))
#若訓練集和測試集上的分數非常接近,說明可能存在欠擬合
print('Training set score:{:.2f}'.format(lr.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))
#LinearRegression在高維資料集上的表現,波士頓房價資料集
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import mglearn
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
lr = LinearRegression().fit(X_train,y_train)
#訓練集和測試集上的效能差異是過擬合的明顯標誌
print('Training set score:{:.2f}'.format(lr.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lr.score(X_test,y_test)))
嶺迴歸
嶺迴歸的預測公式與普通最小二乘法相同,但嶺迴歸用到了L2正則化約束,使每個特徵對輸出的影響儘可能小。更大的alpha表示約束更強的模型,預計大alpha對應的coef_元素比小alpha對應的coef_元素要小。
#Ridge在高維資料集上的表現,波士頓房價資料集
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import mglearn
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
ridge = Ridge().fit(X_train,y_train)
#Rigde在訓練集的分數低於LinearRegression,但在測試集上的分數更高
#線性模型對資料存在過擬合,Ridge是一種約束更強的模型,不容易過擬合
print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))
#調整alpha,增大alpha使得係數更趨向於0,降低訓練集效能,可能!!!提高泛化效能
#Ridge在高維資料集上的表現,波士頓房價資料集
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
import mglearn
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
#預設aloha = 1.0
ridge = Ridge().fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge.score(X_test,y_test)))
#aplha = 10
ridge10 = Ridge(alpha = 10).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge10.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge10.score(X_test,y_test)))
#aplha = 0.1
ridge01 = Ridge(alpha = 0.1).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(ridge01.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(ridge01.score(X_test,y_test)))
#標資料點
plt.plot(ridge.coef_,'s',label = "Ridge alpha = 1")
plt.plot(ridge10.coef_,'^',label = "Ridge alpha = 10")
plt.plot(ridge01.coef_,'v',label = "Ridge alpha = 0.1")
plt.xlabel("Coefficient index") #x軸對應coef_的元素,x=i對應第i個特徵的係數,y軸表示該係數的具體數值
plt.ylabel("Coefficient magnitude") #係數震級
plt.hlines(0,0,len(ridge.coef_)) #畫橫座標
plt.ylim(-25,25) #設定座標軸的最大最小區間
plt.legend(loc = 'best')
import mglearn
#固定alpha值,改變訓練資料量
#對波士頓房價資料二次抽樣,在資料量逐漸增加的子資料集上對LinearRegression和Ridge(alpha = 1)兩個模型評估
#學習曲線
mglearn.plots.plot_ridge_n_samples()
#線性迴歸的訓練效能在下降
#如果有足夠多的資料,正則化變得不那麼重要
Lasso
使用Lasso也是約束其係數使其接近於0,但用到的方法不同,用了L1正則化,L1正則化的結果是使用Lasso時某些係數剛好為0,可以看作是自動化的特徵選擇。
#將Lasso應用在擴充套件的波士頓房價資料集上
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
import mglearn
import numpy as np
import matplotlib.pyplot as plt
X,y = mglearn.datasets.load_extended_boston()
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state = 0)
lasso = Lasso().fit(X_train,y_train)
#在訓練集和測試集上的表現都很差,表示存在欠擬合
print('Training set score:{:.2f}'.format(lasso.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso.coef_ != 0))) #展示係數不為 0 的 feature 個數
#Lasso也有一個正則化引數alpha,預設=1.0,控制係數趨向於0的強度。降低欠擬合,減小alpha,增加max_iter的值(執行迭代的最大次數)
#擬合出一個更復雜的模型
lasso001 = Lasso(alpha = 0.01, max_iter = 100000).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(lasso001.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso001.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso001.coef_ != 0)))
#但把alpha設得太小,會消除正則化的影響,從而出現過擬合
lasso00001 = Lasso(alpha = 0.0001, max_iter = 100000).fit(X_train,y_train)
print('Training set score:{:.2f}'.format(lasso00001.score(X_train,y_train)))
print('Test set score:{:.2f}'.format(lasso00001.score(X_test,y_test)))
print('Number of features used: {}'.format(np.sum(lasso00001.coef_ != 0)))
plt.plot(lasso.coef_,'s',label = 'Lasso alpha = 1')
plt.plot(lasso001.coef_,'^',label = 'Lasso alpha = 0.01')
plt.plot(lasso00001.coef_,'v',label = 'Lasso alpha = 0.0001')
plt.xlabel('Coefficient index')
plt.ylabel('Coefficient magnitude')
plt.legend(ncol = 2,loc = (0,1.05))#列數為2列
plt.ylim(-25,25)
sklearn提供了ElasticNet類,結合了Lasso和Ridge的懲罰項,調節兩個引數:用於L1正則化和L2正則化。
用於分類的線性模型
沒有返回特徵的加權求值,而是為預測設定了闕值(0):y<0,則預測類別-1;y>0,預測類別1。對用於分類的線性模型,決策邊界是輸入的線性函式,即線性分類器是利用直線、平面或超平面來分開兩個類別的分類器。
#將兩種線性分類模型應用到forge資料集上,並將決策邊界視覺化
from sklearn.linear_model import LogisticRegression #Logistic迴歸
from sklearn.svm import LinearSVC #線性支援向量機
import mglearn
X,y = mglearn.datasets.make_forge()
fig,axes = plt.subplots(1,2,figsize = (10,3))
for model,ax in zip([LinearSVC(), LogisticRegression()],axes):
clf = model.fit(X,y)
#alpha引數表示分界線顏色的深淺
mglearn.plots.plot_2d_separator(clf, X, fill = False, eps = 0.5, ax = ax, alpha = 0.7) #決策邊界視覺化
mglearn.discrete_scatter(X[:,0],X[:,1],y,ax = ax) #畫點
ax.set_title("{}".format(clf.__class__.__name__))
ax.set_xlabel("Feature 0")
ax.set_ylabel("Feature 1")
ax.legend(loc = "best")
LogisticRegression 和 LinearSVC 模型預設使用L2正則化,決定正則化強度的權衡引數叫做c。c值越大,對應的正則化越弱。
#不同c值的線性SVM在forge資料集上的決策邊界
import mglearn
mglearn.plots.plot_linear_svc_regularization()
在高維空間中,用於分類的線性模型非常強大。當考慮過多特徵時,避免過擬合越來越重要。
#在乳腺癌資料集上詳細分析LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
cancer = load_breast_cancer()
X_train,X_test,y_train,y_test = train_test_split(cancer.data, cancer.target,stratify = cancer.target,random_state = 42)
#C = 1.0
logreg = LogisticRegression().fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg.score(X_test,y_test)))
#C = 100
logreg100 = LogisticRegression(C = 100).fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg100.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg100.score(X_test,y_test)))
#C = 0.01
logreg001 = LogisticRegression(C = 0.01).fit(X_train,y_train)
print("Training set score:{:.2f}".format(logreg001.score(X_train,y_train)))
print("Test set score:{:.3f}".format(logreg001.score(X_test,y_test)))
plt.plot(logreg.coef_.T,'o',label = "C = 1")
plt.plot(logreg100.coef_.T,'^',label = "C = 100")
plt.plot(logreg001.coef_.T,'v',label = "c = 0.01")
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)
plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend()
#系統可以告訴我們,某個特徵與哪個類別有關。
#使用L1正則化的LogisticRegression
for C, marker in zip([0.001,1,100],['o','^','v']):
lr_l1 = LogisticRegression(C = C, penalty = "l1").fit(X_train,y_train)
print("Training accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_train,y_train)))
print("Test accuracy of l1 logreg with C ={:.3f}:{:.2f}".format(C,lr_l1.score(X_test,y_test)))
plt.plot(lr_l1.coef_.T,marker,label = "C={:.3f}".format(C))
plt.xticks(range(cancer.data.shape[1]),cancer.feature_names,rotation = 90)
plt.hlines(0,0,cancer.data.shape[1])
plt.ylim(-5,5)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.legend(loc = 3)
模型的penalty引數會影響正則化,即模型是使用所有可用特徵還是隻選擇特徵的一個子集。
用於多分類的線性模型
許多線性分類模型只適用於二分類問題,不能輕易推廣到多分類問題,將二分類演算法推廣到多分類演算法的一種常見方法是“一對其餘”。
“一對其餘”,即對每個類別都學習一個二分類模型,將這個類別與其他類別分開。
每個類別都對應一個二分類器,這樣每個類別都有一個係數w向量和一個截距b,其結果中最大值對應的類別即為預測的類別標籤。
#包含3個類別的二維玩具資料集
from sklearn.datasets import make_blobs
import mglearn
import matplotlib.pyplot as plt
X,y = make_blobs(random_state = 42)
mglearn.discrete_scatter(X[:,0],X[:,1],y)
#訓練一個LinearSVC分類器
linear_svm = LinearSVC().fit(X,y)
print("Coefficient shape:", linear_svm.coef_.shape) #三條線,兩個特徵
print("Intercept shape:", linear_svm.intercept_.shape)
line = np.linspace(-15,15)
for coef,intercept,color in zip(linear_svm.coef_, linear_svm.intercept_,['b','r','g']):
plt.plot(line, -(line * coef[0] + intercept) / coef[1], c = color)
plt.ylim(-10,15)
plt.xlim(-10,8)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.legend(["Class 0","Class 1","Class 2","Line class 0","Line class 1","Line class 2"], loc = (1.01,0.3))
mglearn.plots.plot_2d_classification(linear_svm, X, fill = True, alpha = .7) #邊界條件視覺化
對程式碼及方法的疑惑
train_test_split(X, y, stratify=y)
https://blog.csdn.net/weixin_37226516/article/details/62042550
普通最小二乘法(OLS)
https://blog.csdn.net/enjoy524/article/details/53556038
Python的知識點 plt.plot()函式細節
https://blog.csdn.net/cjcrxzz/article/details/79627483
python中Matplotlib的座標軸的座標區間的設定
https://blog.csdn.net/ccy950903/article/details/50688449
矩陣論:向量範數和矩陣範數
https://blog.csdn.net/pipisorry/article/details/51030563
正則化及正則化項的理解
https://blog.csdn.net/gshgsh1228/article/details/52199870
深度學習——L0、L1及L2範數
https://blog.csdn.net/zchang81/article/details/70208061
機器學習 - sklearn.Lasso
https://www.jianshu.com/p/1177a0bcb306