1. 程式人生 > 其它 >機器學習—迴歸2-4(嶺迴歸)

機器學習—迴歸2-4(嶺迴歸)

使用嶺迴歸根據多個因素預測醫療費用

 

 資料集連結:https://www.cnblogs.com/ojbtospark/p/16005626.html

 

主要流程步驟:

1. 匯入包

In [1]:
# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

 

2. 匯入資料集

In [2]:
# 匯入資料集
data = pd.read_csv('insurance.csv')
data.head()
Out[2]:
  age sex bmi children smoker region charges
0 19 female 27.900 0 yes southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520
 

3. 資料預處理

3.1 檢測缺失值

In [3]:
# 檢測缺失值
null_df = data.isnull().sum()
null_df
Out[3]:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

3.2 標籤編碼&獨熱編碼

In [4]:
# 標籤編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True)
data.head()
Out[4]:
  age bmi children charges sex_male smoker_yes region_northwest region_southeast region_southwest
0 19 27.900 0 16884.92400 0 1 0 0 1
1 18 33.770 1 1725.55230 1 0 0 1 0
2 28 33.000 3 4449.46200 1 0 0 1 0
3 33 22.705 0 21984.47061 1 0 1 0 0
4 32 28.880 0 3866.85520 1 0 1 0 0

3.3 得到自變數和因變數

In [5]:
# 得到自變數和因變數
y = data['charges'].values
data = data.drop(['charges'], axis = 1)
x = data.values

3.4 拆分訓練集和測試集

In [6]:
# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(1070, 8)
(268, 8)
(1070,)
(268,)
 

4. 構建不同引數的嶺迴歸模型

4.1 模型1:構建嶺迴歸模型

4.1.1 構建嶺迴歸模型

In [7]:
# 構建不同引數的嶺迴歸模型
# 模型1:構建嶺迴歸模型(alpha = 20)
from sklearn.linear_model import Ridge
regressor = Ridge(alpha = 20, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[7]:
Ridge(alpha=20, normalize=True)

4.1.2 得到數學表示式

In [8]:
# 得到數學表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
    print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
 Charges = age * 12.48 + bmi * 17.21 + children * 14.86 + sex_male * 60.23 + smoker_yes * 1121.22 + region_northwest * -34.52 + region_southeast * 61.62 + region_southwest * -33.53 + 11938.446490743021

4.1.3 預測測試集

In [9]:
# 預測測試集
y_pred = regressor.predict(x_test)

4.1.4 得到模型MSE

In [10]:
# 得到模型 MSE
from sklearn.metrics import mean_squared_error
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=20時,嶺迴歸模型的MSE是:' , format(mse_score, ','))
alpha=20時,嶺迴歸模型的MSE是: 138,769,173.1285671

4.2 模型2:構建嶺迴歸模型

In [11]:
# 模型2:構建嶺迴歸模型(alpha = 0.1)
regressor = Ridge(alpha = 0.1, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[11]:
Ridge(alpha=0.1, normalize=True)
In [12]:
# 得到線性表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
    print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
 Charges = age * 234.53 + bmi * 291.63 + children * 361.72 + sex_male * -88.02 + smoker_yes * 21586.00 + region_northwest * -266.87 + region_southeast * -672.40 + region_southwest * -691.71 + -9237.600606458109
In [13]:
# 預測測試集
y_pred = regressor.predict(x_test)
In [14]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.1時,嶺迴歸模型的MSE是:' , format(mse_score, ','))
alpha=0.1時,嶺迴歸模型的MSE是: 36,841,099.26516503

4.3 模型3:構建嶺迴歸模型

In [15]:
# 模型3:構建嶺迴歸模型(alpha = 0.01)
regressor = Ridge(alpha = 0.01, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[15]:
Ridge(alpha=0.01, normalize=True)
In [16]:
# 得到線性表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
    print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
 Charges = age * 255.00 + bmi * 318.27 + children * 402.86 + sex_male * -223.99 + smoker_yes * 23546.28 + region_northwest * -377.66 + region_southeast * -992.59 + region_southwest * -875.29 + -11075.028462288014
In [17]:
# 預測測試集
y_pred = regressor.predict(x_test)
In [18]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.01時,嶺迴歸模型的MSE是:' , format(mse_score, ','))
alpha=0.01時,嶺迴歸模型的MSE是: 35,539,055.332710184

4.4 模型4:構建嶺迴歸模型

In [19]:
# 模型4:構建嶺迴歸模型(alpha = 0.0001)
regressor = Ridge(alpha = 0.0001, normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[19]:
Ridge(alpha=0.0001, normalize=True)
In [20]:
# 得到線性表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
    print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
 Charges = age * 257.47 + bmi * 321.59 + children * 408.01 + sex_male * -241.97 + smoker_yes * 23784.06 + region_northwest * -395.90 + region_southeast * -1037.90 + region_southwest * -902.75 + -11295.364555495733
In [21]:
# 預測測試集
y_pred = regressor.predict(x_test)
In [22]:
# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred)
print('alpha=0.0001時,嶺迴歸模型的MSE是:' , format(mse_score, ','))
alpha=0.0001時,嶺迴歸模型的MSE是: 35,479,846.30114783
 

結論: 由上面4個模型可見,不同的模型超引數對嶺迴歸模型效能的影響不同。