size not match(label size和 predict size )
XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'
I am training a XGBoostClassifier for my training set.
My training features are in the shape of (45001, 10338) which is a numpy array and my training labels are in the shape of (45001,) [I have 1161 unique labels so I have done a label encoding for the labels] which is also a numpy array.
From the documentation, it clearly says that I can create DMatrix from numpy array. So I am using the above mentioned training features and labels as numpy arrays straightaway. But I am getting the following error
--------------------------------------------------------------------------- XGBoostError Traceback (most recent call last) <ipython-input-30-3de36245534e> in <module>() 13 scale_pos_weight=1, 14 seed=27) ---> 15 modelfit(xgb1, train_x, train_y) <ipython-input-27-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds) 6 xgtrain = xgb.DMatrix(train_data_features, label=train_labels) 7 cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds, ----> 8 metrics='auc',early_stopping_rounds=early_stopping_rounds) 9 alg.set_params(n_estimators=cvresult.shape[0]) 10 /home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks) 399 for fold in cvfolds: 400 fold.update(i, obj) --> 401 res = aggcv([f.eval(i, feval) for f in cvfolds]) 402 403 for key, mean, std in res: /home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in <listcomp>(.0) 399 for fold in cvfolds: 400 fold.update(i, obj) --> 401 res = aggcv([f.eval(i, feval) for f in cvfolds]) 402 403 for key, mean, std in res: /home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in eval(self, iteration, feval) 221 def eval(self, iteration, feval): 222 """"Evaluate the CVPack for one iteration.""" --> 223 return self.bst.eval_set(self.watchlist, iteration, feval) 224 225 /home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in eval_set(self, evals, iteration, feval) 865 _check_call(_LIB.XGBoosterEvalOneIter(self.handle, iteration, 866 dmats, evnames, len(evals), --> 867 ctypes.byref(msg))) 868 return msg.value 869 else: /home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret) 125 """ 126 if ret != 0: --> 127 raise XGBoostError(_LIB.XGBGetLastError()) 128 129 XGBoostError: b'[19:12:58] src/metric/rank_metric.cc:89: Check failed: (preds.size()) == (info.labels.size()) label size predict size not match'
Please find my model Code below:
def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5, early_stopping_rounds=50): if useTrainCV: xgb_param = alg.get_xgb_params() xgb_param['num_class'] = 1161 xgtrain = xgb.DMatrix(train_data_features, label=train_labels) cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds, metrics='auc',early_stopping_rounds=early_stopping_rounds) alg.set_params(n_estimators=cvresult.shape[0]) #Fit the algorithm on the data alg.fit(train_data_features, train_labels, eval_metric='auc') #Predict training set: dtrain_predictions = alg.predict(train_data_features) dtrain_predprob = alg.predict_proba(train_data_features)[:,1] #Print model report: print("\nModel Report") print("Accuracy : %.4g" % metrics.accuracy_score(train_labels, dtrain_predictions))
Where am I going wrong in the above place ?
My classifier as follows :
xgb1 = xgb.XGBClassifier(
learning_rate =0.1,
n_estimators=50,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread=4,
scale_pos_weight=1,
seed=27)
EDIT - 2 After changing evaluation metric,
---------------------------------------------------------------------------
XGBoostError Traceback (most recent call last)
<ipython-input-9-30c62a886c2e> in <module>()
13 scale_pos_weight=1,
14 seed=27)
---> 15 modelfit(xgb1, train_x_trail, train_y_trail)
<ipython-input-8-9d215eac135e> in modelfit(alg, train_data_features, train_labels, useTrainCV, cv_folds, early_stopping_rounds)
6 xgtrain = xgb.DMatrix(train_data_features, label=train_labels)
7 cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=alg.get_params()['n_estimators'], nfold=cv_folds,
----> 8 metrics='auc',early_stopping_rounds=early_stopping_rounds)
9 alg.set_params(n_estimators=cvresult.shape[0])
10
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in cv(params, dtrain, num_boost_round, nfold, stratified, folds, metrics, obj, feval, maximize, early_stopping_rounds, fpreproc, as_pandas, verbose_eval, show_stdv, seed, callbacks)
398 evaluation_result_list=None))
399 for fold in cvfolds:
--> 400 fold.update(i, obj)
401 res = aggcv([f.eval(i, feval) for f in cvfolds])
402
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/training.py in update(self, iteration, fobj)
217 def update(self, iteration, fobj):
218 """"Update the boosters for one iteration"""
--> 219 self.bst.update(self.dtrain, iteration, fobj)
220
221 def eval(self, iteration, feval):
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in update(self, dtrain, iteration, fobj)
804
805 if fobj is None:
--> 806 _check_call(_LIB.XGBoosterUpdateOneIter(self.handle, iteration, dtrain.handle))
807 else:
808 pred = self.predict(dtrain)
/home/carnd/anaconda3/envs/dl/lib/python3.5/site-packages/xgboost/core.py in _check_call(ret)
125 """
126 if ret != 0:
--> 127 raise XGBoostError(_LIB.XGBGetLastError())
128
129
XGBoostError: b'[03:43:03] src/objective/multiclass_obj.cc:42: Check failed: (info.labels.size()) != (0) label set cannot be empty'
asked Jul 23 '17 at 4:56
6172830
====================================================================================================================================================================================================================================================================================================================================================================
2 Answers
+50
The original error that you get is because this metric was not designed for multi-class classification (see here).
You could use scikit learn wrapper of xgboost to overcome this issue. I modified your code with this wrapper, to produce similar function. I am not sure why are you doing gridsearch though, as you are not enumerating over parameters. Instead, you are using the parameters you specified in xgb1
. Here is the modified code:
import xgboost as xgb
import sklearn
import numpy as np
from sklearn.model_selection import GridSearchCV
def modelfit(alg, train_data_features, train_labels,useTrainCV=True, cv_folds=5):
if useTrainCV:
params=alg.get_xgb_params()
xgb_param=dict([(key,[params[key]]) for key in params])
boost = xgb.sklearn.XGBClassifier()
cvresult = GridSearchCV(boost,xgb_param,cv=cv_folds)
cvresult.fit(X,y)
alg=cvresult.best_estimator_
#Fit the algorithm on the data
alg.fit(train_data_features, train_labels)
#Predict training set:
dtrain_predictions = alg.predict(train_data_features)
dtrain_predprob = alg.predict_proba(train_data_features)[:,1]
#Print model report:
print("\nModel Report")
print("Accuracy : %.4g" % sklearn.metrics.accuracy_score(train_labels, dtrain_predictions))
xgb1 = xgb.sklearn.XGBClassifier(
learning_rate =0.1,
n_estimators=50,
max_depth=5,
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective='multi:softmax',
nthread=4,
scale_pos_weight=1,
seed=27)
X=np.random.normal(size=(200,30))
y=np.random.randint(0,5,200)
modelfit(xgb1, X, y)
The output that I get is
Model Report
Accuracy : 1
Note that I used much smaller size for the data. With the size that you mentioned, the algorithm may be very slow.
answered Aug 4 '17 at 15:41
10.9k82746
-
In tensorflow, we create batches and run them. can I run this algorithm in batch wise ? Let's say 100 records after another ? How can I save this model and train it again ? I will accept your answer – Kathiravan Natarajan Aug 5 '17 at 0:21
-
When you train neural network on tensorflow you use batch gradient descent. Thus you can do that in chunks. However, xgboost operates differently, so you cannot just separate it into chunks. However, I looked at xgboost faq page: xgboost.readthedocs.io/en/latest/faq.html, and in the section about large data sets they write this: XGBoost is designed to be memory efficient. Usually it can handle problems as long as the data fit into your memory (This usually means millions of instances). If you are running out of memory, checkout external memory version or distributed version of xgboost – Miriam Farber Aug 5 '17 at 8:45
-
Thus, based on the above quote, it seems that you can try to run the code on your computer as it is. You can also put verbose=2 in GridSearchCV so that it will print more details while it's running. If it won't work, you could try the distributed version. They give a link to it from the faq page (the one I linked to in the previous comment). You could also put useTrainCV=False. As you have one set of parameters, you don't really need the gridsearch, so you can skip that part of your code (which is currently the most heavy part in your code). – Miriam Farber Aug 5 '17 at 9:00
====================================================================================================================================================================================================================================================================================================================================================================
The error is b/c you are trying to use AUC evaluation metric for multiclass classification, but AUC is only applicable for two-class problems. In xgboost implementation, "auc" expects prediction size to be the same as label size, while your multiclass prediction size would be 45001*1161. Use either "mlogloss" or "merror" multiclass metrics.
P.S.: currently, xgboost would be rather slow with so many classes, as there is some inefficiency with predictions caching during training.
answered Aug 3 '17 at 2:59
46328
-
Please check the new error above after changing the evaluation metric – Kathiravan Natarajan Aug 4 '17 at 3:44