機器學習—簡單線性迴歸2-2
阿新 • • 發佈:2022-03-14
使用多元線性迴歸根據多個因素預測醫療費用
主要步驟流程:
- 1. 匯入包
- 2. 匯入資料集
-
3. 資料預處理4. 構建多元線性迴歸模型
- 3.1 檢測缺失值
- 3.2 標籤編碼&獨熱編碼
- 3.3 得到自變數和因變數
- 3.4 拆分訓練集和測試集
- 5. 得到模型表示式
- 6. 預測測試集
- 7. 得到模型MSE
- 8. 畫出吸菸與醫療費用的小提琴圖
1. 匯入包
In [1]:# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
2. 匯入資料集
In [2]:# 匯入資料集
data = pd.read_csv('insurance.csv')
data.head(5)
Out[2]:
age | sex | bmi | children | smoker | region | charges | |
---|---|---|---|---|---|---|---|
0 | 19 | female | 27.900 | 0 | yes | southwest | 16884.92400 |
1 | 18 | male |
33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
3. 資料預處理
3.1 檢測缺失值
In [3]:# 檢測缺失值
null_df = data.isnull().sum()
null_df
Out[3]:
age 0 sex 0 bmi 0 children 0 smoker 0 region 0 charges 0 dtype: int64
3.2 標籤編碼&獨熱編碼
In [4]:# 標籤編碼&獨熱編碼
data = pd.get_dummies(data, drop_first = True)
3.3 得到自變數和因變數
In [5]:# 得到自變數和因變數
y = data['charges'].values
data = data.drop(['charges'], axis = 1)
x = data.values
3.4 拆分訓練集和測試集
In [6]:# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)
(1070, 8)
(268, 8)
(1070,)
(268,)
4. 構建多元線性迴歸模型
In [7]:# 構建多元線性迴歸模型
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(normalize = True, fit_intercept = True)
regressor.fit(x_train, y_train)
Out[7]:
LinearRegression(normalize=True)
5. 得到模型表示式
In [8]:# 得到模型表示式
print('數學表示式是:\n Charges = ', end='')
columns = data.columns
coefs = regressor.coef_
for i in range(len(columns)):
print('%s * %.2f + ' %(columns[i], coefs[i]), end='')
print(regressor.intercept_)
數學表示式是:
Charges = age * 257.49 + bmi * 321.62 + children * 408.06 + sex_male * -242.15 + smoker_yes * 23786.49 + region_northwest * -396.10 + region_southeast * -1038.38 + region_southwest * -903.03 + -11297.610008539417
由上述數學表示式可見,smoker_yes變數對因變數較大
6. 預測測試集
In [9]:# 預測測試集
y_pred = regressor.predict(x_test)
In [10]:
compare_df = pd.DataFrame(y_test, columns=['truth'])
compare_df['pred'] = y_pred
compare_df.head(10)
Out[10]:
truth | pred | |
---|---|---|
0 | 1646.42970 | 4383.680900 |
1 | 11353.22760 | 12885.038922 |
2 | 8798.59300 | 12589.216532 |
3 | 10381.47870 | 13286.229192 |
4 | 2103.08000 | 544.728328 |
5 | 38746.35510 | 32117.584008 |
6 | 9304.70190 | 12919.042372 |
7 | 11658.11505 | 12318.621830 |
8 | 3070.80870 | 3784.291456 |
9 | 19539.24300 | 29468.457254 |
7. 得到模型MSE
In [11]:# 得到模型的MSE
from sklearn.metrics import mean_squared_error
mse_score = mean_squared_error(y_test, y_pred)
print('多元線性迴歸模型的MSE是:%.2f' %(mse_score))
多元線性迴歸模型的MSE是:35479352.81
8. 畫出吸菸與醫療費用的小提琴圖
In [12]:# 畫出吸菸與醫療費用的小提琴圖
data['charges'] = y
sns.violinplot(x='smoker_yes', y='charges', data=data)
sns.stripplot(x='smoker_yes', y='charges', jitter=True, color='red', data=data)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x1f84b105108>
結論 1)由上述小提琴圖可見,不吸菸者(左圖)大多數集中在中位數以下,中位數以上的點佔少數; 2)吸菸者(右圖)小提琴圖上下比較對稱分佈較均勻,且最小值都達到不吸菸者醫療費用的中位數; 3)2個小提琴圖對比說明吸菸者的平均醫療費用遠遠高於不吸菸者的平均醫療費用; 4)這證明多元線性迴歸模型的數學表示式比較準確,吸菸與否很大程度影響著醫療費用;