1. 程式人生 > >Learn: Model Validation

Learn: Model Validation

文章目錄

Mean Absolute Error(MAE)

There are many metrics for summarizing model quality, but we’ll start with one called Mean Absolute Error (also called MAE). 廣義上,error:
error= actual- predicted

MAE:
mean absolute error

MSE:
mean squared error

Model

#Model
import pandas as pd

melbourne_data= pd.read_csv(r'G:\kaggle\melb_data.csv')
filtered_melbourne_data= melbourne_data.dropna( axis=0 )
y= filtered_melbourne_data.Price
melbourne_features=['Rooms', 'Bathroom', 'Landsize', 'BuildingArea'
, 'YearBuilt', 'Lattitude', 'Longtitude'] X=filtered_melbourne_data[melbourne_features] from sklearn.tree import DecisionTreeRegressor melbourne_model= DecisionTreeRegressor( random_state=1 ) melbourne_model.fit(X,y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')
filtered_melbourne_data[['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']].head()
#注意:裡面雙括號哈,只能一個引數
Rooms Bathroom Landsize BuildingArea YearBuilt Lattitude Longtitude
1 2 1.0 156.0 79.0 1900.0 -37.8079 144.9934
2 3 2.0 134.0 150.0 1900.0 -37.8093 144.9944
4 4 1.0 120.0 142.0 2014.0 -37.8072 144.9941
6 3 2.0 245.0 210.0 1910.0 -37.8024 144.9993
7 2 1.0 256.0 107.0 1890.0 -37.8060 144.9954
help(DecisionTreeRegressor)

DecisionTreeRegressor():
引數:
criterion 預設為’mse’:mean squared error

算MAE:mean absolute error

Training data上,訓練誤差:

from sklearn.metrics import mean_absolute_error

predicted_home_price= melbourne_model.predict(X)
mean_absolute_error( y, predicted_home_price) #returns:loss
434.71594577146544

“In-Sample” Scores的問題

我們將所有的data都拿來訓練模型,然後我們還將拿來訓練模型的data去算誤差。
而模型的意義是對new data進行預測。也許我們在這些data上誤差小,擬合地特別好,但是來了新資料呢?也許就會很糟糕了。解決:Validation data驗證集

解決"In-Sample" Scores問題

The scikit-learn library has a function train_test_split to break up the data into two pieces。
We'll use some of that data as training data to fit the model, 
and we'll use the other data as validation data to calculate mean_absolute_error.

scikit_learn庫中提供了函式train_test_split,將data分為兩部分:
一部分data作為:training data去fit model
另一部分data作為:validation data去計算誤差

分割資料集為訓練集和驗證集(train_test_split())

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y= train_test_split(X, y, test_size=0.33, random_state=0)#預設test_size=0.25
help(train_test_split)
len(train_X),len(train_y)
(4151, 4151)
len(val_X),len(val_y)
(2045, 2045)

用訓練集訓練模型

#melbourne_model= DecisionTreeRegressor()#define模型
melbourne_model.fit(train_X, train_y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
           max_leaf_nodes=None, min_impurity_decrease=0.0,
           min_impurity_split=None, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           presort=False, random_state=1, splitter='best')

用驗證集測試模型

val_prediction_y= melbourne_model.predict(val_X)
mean_absolute_error(val_y, val_prediction_y)
254577.2400977995
val_y.mean()
1088835.6136919316

模型需要改進