機器學習—簡單線性迴歸2-3

阿新 • • 發佈：2022-03-14

使用多項式迴歸根據年齡預測醫療費用

主要步驟流程：

1. 匯入包
2. 匯入資料集
3. 資料預處理
- 3.1 檢測缺失值
- 3.2 篩選資料
- 3.3 得到因變數
- 3.4 建立自變數
- 3.5 檢驗新的自變數和charges的相關性
- 3.6 拆分訓練集和測試集
4. 構建多項式迴歸模型
- 4.1 構建模型
- 4.2 得到線性表示式
- 4.3 預測測試集
- 4.4 得到模型的MSE
5. 構建簡單線性迴歸模型（用於對比）6. 對比2種模型視覺化效果
- 5.1 構建簡單線性迴歸模型（用於對比）
- 5.2 預測測試集
- 5.3 得到模型的MSE

資料集連結：

https://www.cnblogs.com/ojbtospark/p/16005626.html

1. 匯入包

In [2]:

# 匯入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

2. 匯入資料集

In [3]:

# 匯入資料集
data = pd.read_csv('insurance.csv')
data.head()

Out[3]:

	age	sex	bmi	children	smoker	region	charges
0	19	female	27.900	0	yes	southwest	16884.92400
1	18	male	33.770	1	no	southeast	1725.55230
2	28	male	33.000	3	no	southeast	4449.46200
3	33	male	22.705	0	no	northwest	21984.47061
4	32	male	28.880	0	no	northwest	3866.85520

3. 資料預處理

3.1 檢測缺失值

In [4]:

# 檢測缺失值 

null_df = data.isnull().sum()
null_df

Out[4]:

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

3.2 篩選資料

In [5]:

# 畫出age和charges的散點圖
plt.figure()
plt.scatter(data['age'], data['charges'])
plt.title('Charges vs Age (Origin Dataset)')
plt.show()

In [6]:

# 篩選資料
new_data_1 = data.query('age<=40 & charges<=10000') # 40歲以下 且 10000元以下
new_data_2 = data.query('age>40 & age<=50 & charges<=12500') # 40歲至50歲之間 且 12500元以下
new_data_3 = data.query('age>50 & charges<=17000') # 50歲以上 且 17000元以下
new_data = pd.concat([new_data_1, new_data_2, new_data_3], axis=0)

In [7]:

# 畫出age和charges的散點圖
plt.figure()
plt.scatter(new_data['age'], new_data['charges'])
plt.title('Charges vs Age (Filtered Dataset)')
plt.show()

In [8]:

# 檢查age和charges的相關性
print('age和charges的相關性是：\n', np.corrcoef(new_data['age'], new_data['charges']))

age和charges的相關性是：
 [[1.         0.97552029]
 [0.97552029 1.        ]]

3.3 得到因變數

In [9]:

# 得到因變數
y = new_data['charges'].values

3.4 建立自變數

In [10]:

# 建立自變數
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4, include_bias=False)
x_poly = poly_reg.fit_transform(new_data.iloc[:, 0:1].values)
x_poly

Out[10]:

array([[1.8000000e+01, 3.2400000e+02, 5.8320000e+03, 1.0497600e+05],
       [2.8000000e+01, 7.8400000e+02, 2.1952000e+04, 6.1465600e+05],
       [3.2000000e+01, 1.0240000e+03, 3.2768000e+04, 1.0485760e+06],
       ...,
       [5.2000000e+01, 2.7040000e+03, 1.4060800e+05, 7.3116160e+06],
       [5.7000000e+01, 3.2490000e+03, 1.8519300e+05, 1.0556001e+07],
       [5.2000000e+01, 2.7040000e+03, 1.4060800e+05, 7.3116160e+06]])

In [11]:

# 列印age資料
new_data.iloc[:, 0:1]

Out[11]:

	age
1	18
2	28
4	32
5	31
7	37
...	...
1325	61
1327	51
1329	52
1330	57
1332	52

966 rows × 1 columns

3.5 檢驗新的自變數和charges的相關性

In [12]:

# 檢驗新的自變數和charges的相關性
corr_df = pd.DataFrame(x_poly, columns=['one','two','three','four'])
corr_df['charges'] = y
print('age的n次冪和charges的相關性是：\n', corr_df.corr(method='pearson'))

age的n次冪和charges的相關性是：
               one       two     three      four   charges
one      1.000000  0.988503  0.960359  0.924344  0.975520
two      0.988503  1.000000  0.991262  0.970392  0.977944
three    0.960359  0.991262  1.000000  0.993638  0.961974
four     0.924344  0.970392  0.993638  1.000000  0.935838
charges  0.975520  0.977944  0.961974  0.935838  1.000000

3.6 拆分訓練集和測試集

In [13]:

# 拆分訓練集和測試集
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_poly, y, test_size = 0.2, random_state = 1)

4. 構建多項式迴歸模型

4.1 構建模型

In [14]:

# 構建多項式迴歸模型
from sklearn.linear_model import LinearRegression
regressor_pr = LinearRegression(normalize = True, fit_intercept = True)
regressor_pr.fit(x_train, y_train)

Out[14]:

LinearRegression(normalize=True)

4.2 得到線性表示式

In [15]:

# 得到線性表示式
print('Charges = %.2f * Age + %.2f * Age^2 + %.2f * Age^3 + %.2f * Age^4 + %.2f' 
      %(regressor_pr.coef_[0], regressor_pr.coef_[1], regressor_pr.coef_[2], regressor_pr.coef_[3], regressor_pr.intercept_))
# Charges = -300.10 * Age + 19.35 * Age^2 + -0.31 * Age^3 + 0.00 * Age^4 + 2687.10

Charges = -300.10 * Age + 19.35 * Age^2 + -0.31 * Age^3 + 0.00 * Age^4 + 2687.10

4.3 預測測試集

In [16]:

# 預測測試集
y_pred_pr = regressor_pr.predict(x_test)

4.4 得到模型的MSE

In [17]:

# 得到模型的MSE
from sklearn.metrics import mean_squared_error
mse_score = mean_squared_error(y_test, y_pred_pr)
print('多項式迴歸模型的MSE是：%.2f' %(mse_score)) # 654,495.38

多項式迴歸模型的MSE是：654495.38

5. 構建簡單線性迴歸模型（用於對比）

5.1 構建簡單線性迴歸模型（用於對比）

In [18]:

# 構建簡單線性迴歸模型（用於對比）
regressor_slr = LinearRegression(normalize = True, fit_intercept = True)
regressor_slr.fit(x_train[:,0:1], y_train)

Out[18]:

LinearRegression(normalize=True)

5.2 預測測試集

In [19]:

# 預測測試集
y_pred_slr = regressor_slr.predict(x_test[:,0:1])

5.3 得到模型的MSE

In [20]:

# 得到模型的MSE
mse_score = mean_squared_error(y_test, y_pred_slr)
print('簡單線性迴歸模型的MSE是：%.2f' %(mse_score))

簡單線性迴歸模型的MSE是：738002.02

6. 對比2種模型視覺化效果

In [21]:

# 視覺化測試集預測結果
plt.scatter(x_test[:,0], y_test, color = 'green', alpha=0.5)
plt.plot(x_test[:,0], y_pred_slr, color = 'blue')
plt.plot(x_test[:,0], y_pred_pr, color = 'red')
plt.title('Charges vs Age (Test set)')
plt.xlabel('Age')
plt.ylabel('Charges')
plt.show()