xgboost 特征選擇,篩選特征的正要性
阿新 • • 發佈:2018-04-17
list datasets columns https 解決 只需要 from AC orm
所以,再次強調:不要用任何的模塊名作為文件名!
import pandas as pd import xgboost as xgb import operator from matplotlib import pylab as plt def ceate_feature_map(features): outfile = open(‘xgb.fmap‘, ‘w‘) i = 0 for feat in features: outfile.write(‘{0}\t{1}\tq\n‘.format(i, feat)) i = i + 1 outfile.close() def get_data(): train = pd.read_csv("../input/train.csv") features = list(train.columns[2:]) y_train = train.Hazard for feat in train.select_dtypes(include=[‘object‘]).columns: m = train.groupby([feat])[‘Hazard‘].mean() train[feat].replace(m,inplace=True) x_train = train[features] return features, x_train, y_train def get_data2(): from sklearn.datasets import load_iris #獲取數據 iris = load_iris() x_train=pd.DataFrame(iris.data) features=["sepal_length","sepal_width","petal_length","petal_width"] x_train.columns=features y_train=pd.DataFrame(iris.target) return features, x_train, y_train #features, x_train, y_train = get_data() features, x_train, y_train = get_data2() ceate_feature_map(features) xgb_params = {"objective": "reg:linear", "eta": 0.01, "max_depth": 8, "seed": 42, "silent": 1} num_rounds = 1000 dtrain = xgb.DMatrix(x_train, label=y_train) gbdt = xgb.train(xgb_params, dtrain, num_rounds) importance = gbdt.get_fscore(fmap=‘xgb.fmap‘) importance = sorted(importance.items(), key=operator.itemgetter(1)) df = pd.DataFrame(importance, columns=[‘feature‘, ‘fscore‘]) df[‘fscore‘] = df[‘fscore‘] / df[‘fscore‘].sum() plt.figure() df.plot() df.plot(kind=‘barh‘, x=‘feature‘, y=‘fscore‘, legend=False, figsize=(16, 10)) plt.title(‘XGBoost Feature Importance‘) plt.xlabel(‘relative importance‘) plt.gcf().savefig(‘feature_importance_xgb.png‘)
根據結構分數的增益情況計算出來選擇哪個特征的哪個分割點,某個特征的重要性,就是它在所有樹中出現的次數之和。
參考:https://blog.csdn.net/q383700092/article/details/53698760
另外:使用xgboost,遇到一個問題
看到網上有一個辦法: 重新新建Python文件,把你的代碼拷過去;或者重命名也可以;還不行,就把代碼復制到別的地方(不能在原始文件夾內),會重新編譯,就正常了 但是我覺得本質問題不是這樣解決的,但臨時應急還是可以的,歡迎討論! 問題根源: 初學者或者說不太了解Python才會犯這種錯誤,其實只需要註意一點!不要使用任何模塊名作為文件名,任何類型的文件都不可以!我的錯誤根源是在文件夾中使用xgboost.*的文件名,當import xgboost時會首先在當前文件中查找,才會出現這樣的問題。xgboost 特征選擇,篩選特征的正要性