機器學習—迴歸2-5(LASSO迴歸)
阿新 • • 發佈:2022-03-15
使用LASSO迴歸根據多個因素預測醫療費用
主要步驟流程:
- 1. 匯入包
- 2. 匯入資料集
-
3. 資料預處理
- 3.1 檢測缺失值
- 3.2 標籤編碼&獨熱編碼
- 3.3 得到自變數和因變數
- 3.4 拆分訓練集和測試集
- 3.5 特徵縮放
-
4. 構建不同引數的LASSO迴歸模型
- 4.1 模型1:構建LASSO迴歸模型
- 4.1.1 構建LASSO迴歸模型
- 4.1.2 得到模型表示式
- 4.1.3 預測測試集
- 4.1.4 得到模型MSE
- 4.2 模型2:構建LASSO迴歸模型
- 4.3 模型3:構建LASSO迴歸模型
- 4.4 模型4:構建LASSO迴歸模型
- 4.1 模型1:構建LASSO迴歸模型
1. 匯入包
In [1]:# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
2. 匯入資料集
In [2]:# 匯入資料集
data = pd.read_csv('insurance.csv')
data.head()
Out[2]:
age | sex | bmi | children |
smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
3. 資料預處理
3.1 檢測缺失值
In [3]:# 檢測缺失值
null_df = data.isnull().sum()
null_df
Out[3]:
age 0
sex 0
bmi 0
children 0
smoker 0
region 0
charges 0
dtype: int64
3.2 標籤編碼&獨熱編碼
In [4]:# 標籤編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True)
data.head()
Out[4]:
age | bmi | children | charges | sex_male | smoker_yes | region_northwest | region_southeast | region_southwest | |
---|---|---|---|---|---|---|---|---|---|
0 | 19 | 27.900 | 0 | 16884.92400 | 0 | 1 | 0 | 0 | 1 |
1 | 18 | 33.770 | 1 | 1725.55230 | 1 | 0 | 0 | 1 | 0 |
2 | 28 | 33.000 | 3 | 4449.46200 | 1 | 0 | 0 | 1 | 0 |
3 | 33 | 22.705 | 0 | 21984.47061 | 1 | 0 | 1 | 0 | 0 |
4 | 32 | 28.880 | 0 | 3866.85520 | 1 | 0 | 1 | 0 | 0 |
3.3 得到自變數和因變數
In [5]:# 得到自變數和因變數
y = data['charges'].values
data = data.drop(['charges'], axis = 1)
x = data.values
3.4 拆分訓練集和測試集
In [6]:# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(1070, 8)
(268, 8)
(1070,)
(268,)
3.5 特徵縮放
In [7]:# 特徵縮放
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
x_train = sc_x.fit_transform(x_train)
x_test = sc_x.transform(x_test)
sc_y = StandardScaler()
y_train = np.ravel(sc_y.fit_transform(y_train.reshape(-1, 1)))
4. 構建不同引數的LASSO迴歸模型
4.1 模型1:構建LASSO迴歸模型
4.1.1 構建LASSO迴歸模型
In [8]:# 構建不同引數的LASSO迴歸模型
# 模型1:構建LASSO迴歸模型(alpha = 0.1)
from sklearn.linear_model import Lasso
regressor = Lasso(alpha = 0.1, normalize = False, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[8]:
Lasso(alpha=0.1)
4.1.2 得到模型表示式
In [9]:# 得到模型表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.5f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
Charges = age * 0.20808 + bmi * 0.06406 + children * 0.00000 + sex_male * 0.00000 + smoker_yes * 0.69192 + region_northwest * -0.00000 + region_southeast * 0.00000 + region_southwest * -0.00000 + -3.523728190184538e-16
由數學表示式可見,bmi、children等特徵的係數是0。達到了降維的目的。
4.1.3 預測測試集
In [10]:# 預測測試集
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特徵縮放之前的
4.1.4 得到模型MSE
In [11]:# 得到模型MSE
from sklearn.metrics import mean_squared_error
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.1時,LASSO迴歸模型的MSE是:', format(mse_score, ','))
alpha=0.1時,LASSO迴歸模型的MSE是: 42,343,876.719546765
4.2 模型2:構建LASSO迴歸模型
In [12]:# 模型2:構建LASSO迴歸模型(alpha = 0.01)
regressor = Lasso(alpha = 0.01, normalize = False, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[12]:
Lasso(alpha=0.01)
In [13]:
# 得到線性表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
Charges = age * 0.29 + bmi * 0.15 + children * 0.03 + sex_male * -0.00 + smoker_yes * 0.78 + region_northwest * 0.00 + region_southeast * -0.01 + region_southwest * -0.01 + -7.632127816830208e-16
In [14]:
# 預測測試集
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特徵縮放之前的
In [15]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.01時,LASSO迴歸模型的MSE是:', format(mse_score, ','))
alpha=0.01時,LASSO迴歸模型的MSE是: 35,879,738.58883889
4.3 模型3:構建LASSO迴歸模型
In [16]:# 模型3:構建LASSO迴歸模型(alpha = 1e-5)
regressor = Lasso(alpha = 1e-5, normalize = False, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[16]:
Lasso(alpha=1e-05)
In [17]:
# 得到線性表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
Charges = age * 0.30 + bmi * 0.16 + children * 0.04 + sex_male * -0.01 + smoker_yes * 0.80 + region_northwest * -0.01 + region_southeast * -0.04 + region_southwest * -0.03 + -8.359515966965514e-16
In [18]:
# 預測測試集
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特徵縮放之前的
In [19]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=1e-5時,LASSO迴歸模型的MSE是:', format(mse_score, ','))
alpha=1e-5時,LASSO迴歸模型的MSE是: 35,479,553.42378739
4.4 模型4:構建LASSO迴歸模型
In [20]:# 模型4:構建LASSO迴歸模型(alpha = 1e-9)
regressor = Lasso(alpha = 1e-9, normalize = False, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[20]:
Lasso(alpha=1e-09)
In [21]:
# 得到線性表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
Charges = age * 0.30 + bmi * 0.16 + children * 0.04 + sex_male * -0.01 + smoker_yes * 0.80 + region_northwest * -0.01 + region_southeast * -0.04 + region_southwest * -0.03 + -8.360255596886574e-16
In [22]:
# 預測測試集
y_pred = regressor.predict(x_test)
y_pred = sc_y.inverse_transform(y_pred) # y_pred變回特徵縮放之前的
In [23]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=1e-9時,LASSO迴歸模型的MSE是:', format(mse_score, ','))
alpha=1e-9時,LASSO迴歸模型的MSE是: 35,479,352.82734644
結論: 由上面4個模型可見,不同超引數對LASSO迴歸模型效能的影響不同