Learn: Model Validation
阿新 • • 發佈:2019-01-13
文章目錄
Mean Absolute Error(MAE)
There are many metrics for summarizing model quality, but we’ll start with one called Mean Absolute Error (also called MAE). 廣義上,error:
error= actual- predicted
MAE:
mean absolute error
MSE:
mean squared error
Model
#Model
import pandas as pd
melbourne_data= pd.read_csv(r'G:\kaggle\melb_data.csv')
filtered_melbourne_data= melbourne_data.dropna( axis=0 )
y= filtered_melbourne_data.Price
melbourne_features=['Rooms', 'Bathroom', 'Landsize', 'BuildingArea' , 'YearBuilt', 'Lattitude', 'Longtitude']
X=filtered_melbourne_data[melbourne_features]
from sklearn.tree import DecisionTreeRegressor
melbourne_model= DecisionTreeRegressor( random_state=1 )
melbourne_model.fit(X,y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=1, splitter='best')
filtered_melbourne_data[['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 'YearBuilt', 'Lattitude', 'Longtitude']].head()
#注意:裡面雙括號哈,只能一個引數
Rooms | Bathroom | Landsize | BuildingArea | YearBuilt | Lattitude | Longtitude | |
---|---|---|---|---|---|---|---|
1 | 2 | 1.0 | 156.0 | 79.0 | 1900.0 | -37.8079 | 144.9934 |
2 | 3 | 2.0 | 134.0 | 150.0 | 1900.0 | -37.8093 | 144.9944 |
4 | 4 | 1.0 | 120.0 | 142.0 | 2014.0 | -37.8072 | 144.9941 |
6 | 3 | 2.0 | 245.0 | 210.0 | 1910.0 | -37.8024 | 144.9993 |
7 | 2 | 1.0 | 256.0 | 107.0 | 1890.0 | -37.8060 | 144.9954 |
help(DecisionTreeRegressor)
DecisionTreeRegressor():
引數:
criterion 預設為’mse’:mean squared error
算MAE:mean absolute error
Training data上,訓練誤差:
from sklearn.metrics import mean_absolute_error
predicted_home_price= melbourne_model.predict(X)
mean_absolute_error( y, predicted_home_price) #returns:loss
434.71594577146544
“In-Sample” Scores的問題
我們將所有的data都拿來訓練模型,然後我們還將拿來訓練模型的data去算誤差。
而模型的意義是對new data進行預測。也許我們在這些data上誤差小,擬合地特別好,但是來了新資料呢?也許就會很糟糕了。解決:Validation data驗證集
解決"In-Sample" Scores問題
The scikit-learn library has a function train_test_split to break up the data into two pieces。
We'll use some of that data as training data to fit the model,
and we'll use the other data as validation data to calculate mean_absolute_error.
scikit_learn庫中提供了函式train_test_split,將data分為兩部分:
一部分data作為:training data去fit model
另一部分data作為:validation data去計算誤差
分割資料集為訓練集和驗證集(train_test_split())
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y= train_test_split(X, y, test_size=0.33, random_state=0)#預設test_size=0.25
help(train_test_split)
len(train_X),len(train_y)
(4151, 4151)
len(val_X),len(val_y)
(2045, 2045)
用訓練集訓練模型
#melbourne_model= DecisionTreeRegressor()#define模型
melbourne_model.fit(train_X, train_y)
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, splitter='best')
用驗證集測試模型
val_prediction_y= melbourne_model.predict(val_X)
mean_absolute_error(val_y, val_prediction_y)
254577.2400977995
val_y.mean()
1088835.6136919316
模型需要改進