python調參神器hyperopt
一、安裝
pip install hyperopt
二、說明
Hyperopt提供了一個優化接口,這個接口接受一個評估函數和參數空間,能計算出參數空間內的一個點的損失函數值。用戶還要指定空間內參數的分布情況。
Hyheropt四個重要的因素:指定需要最小化的函數,搜索的空間,采樣的數據集(trails database)(可選),搜索的算法(可選)。
首先,定義一個目標函數,接受一個變量,計算後返回一個函數的損失值,比如要最小化函數q(x,y) = x**2 + y**2
指定搜索的算法,算法也就是hyperopt的fmin函數的algo參數的取值。當前支持的算法由隨機搜索(對應是hyperopt.rand.suggest),模擬退火(對應是hyperopt.anneal.suggest),TPE算法。
關於參數空間的設置,比如優化函數q,輸入fmin(q,space=hp.uniform(‘a’,0,1)).hp.uniform函數的第一個參數是標簽,每個超參數在參數空間內必須具有獨一無二的標簽。hp.uniform指定了參數的分布。其他的參數分布比如
hp.choice返回一個選項,選項可以是list或者tuple.options可以是嵌套的表達式,用於組成條件參數。
hp.pchoice(label,p_options)以一定的概率返回一個p_options的一個選項。這個選項使得函數在搜索過程中對每個選項的可能性不均勻。
hp.uniform(label,low,high)參數在low和high之間均勻分布。
hp.quniform(label,low,high,q),參數的取值是round(uniform(low,high)/q)*q,適用於那些離散的取值。
hp.loguniform(label,low,high)繪制exp(uniform(low,high)),變量的取值範圍是[exp(low),exp(high)]
hp.randint(label,upper) 返回一個在[0,upper)前閉後開的區間內的隨機整數。
搜索空間可以含有list和dictionary.
from hyperopt import hp
list_space = [
hp.uniform(’a’, 0, 1),
hp.loguniform(’b’, 0, 1)]
tuple_space = (
hp.uniform(’a’, 0, 1),
hp.loguniform(’b’, 0, 1))
dict_space = {
’a’: hp.uniform(’a’, 0, 1),
’b’: hp.loguniform(’b’, 0, 1)}
三、簡單例子
from hyperopt import hp,fmin, rand, tpe, space_eval def q (args) : x, y = args return x**2-2*x+1 + y**2 space = [hp.randint(‘x‘, 5), hp.randint(‘y‘, 5)] best = fmin(q,space,algo=rand.suggest,max_evals=10) print(best)
輸出:
{‘x‘: 2, ‘y‘: 0}
四、xgboost舉例
xgboost具有很多的參數,把xgboost的代碼寫成一個函數,然後傳入fmin中進行參數優化,將交叉驗證的auc作為優化目標。auc越大越好,由於fmin是求最小值,因此求-auc的最小值。所用的數據集是202列的數據集,第一列樣本id,最後一列是label,中間200列是屬性。
#coding:utf-8 import numpy as np import pandas as pd from sklearn.preprocessing import MinMaxScaler import xgboost as xgb from random import shuffle from xgboost.sklearn import XGBClassifier from sklearn.cross_validation import cross_val_score import pickle import time from hyperopt import fmin, tpe, hp,space_eval,rand,Trials,partial,STATUS_OK def loadFile(fileName = "E://zalei//browsetop200Pca.csv"): data = pd.read_csv(fileName,header=None) data = data.values return data data = loadFile() label = data[:,-1] attrs = data[:,:-1] labels = label.reshape((1,-1)) label = labels.tolist()[0] minmaxscaler = MinMaxScaler() attrs = minmaxscaler.fit_transform(attrs) index = range(0,len(label)) shuffle(index) trainIndex = index[:int(len(label)*0.7)] print len(trainIndex) testIndex = index[int(len(label)*0.7):] print len(testIndex) attr_train = attrs[trainIndex,:] print attr_train.shape attr_test = attrs[testIndex,:] print attr_test.shape label_train = labels[:,trainIndex].tolist()[0] print len(label_train) label_test = labels[:,testIndex].tolist()[0] print len(label_test) print np.mat(label_train).reshape((-1,1)).shape def GBM(argsDict): max_depth = argsDict["max_depth"] + 5 n_estimators = argsDict[‘n_estimators‘] * 5 + 50 learning_rate = argsDict["learning_rate"] * 0.02 + 0.05 subsample = argsDict["subsample"] * 0.1 + 0.7 min_child_weight = argsDict["min_child_weight"]+1 print "max_depth:" + str(max_depth) print "n_estimator:" + str(n_estimators) print "learning_rate:" + str(learning_rate) print "subsample:" + str(subsample) print "min_child_weight:" + str(min_child_weight) global attr_train,label_train gbm = xgb.XGBClassifier(nthread=4, #進程數 max_depth=max_depth, #最大深度 n_estimators=n_estimators, #樹的數量 learning_rate=learning_rate, #學習率 subsample=subsample, #采樣數 min_child_weight=min_child_weight, #孩子數 max_delta_step = 10, #10步不降則停止 objective="binary:logistic") metric = cross_val_score(gbm,attr_train,label_train,cv=5,scoring="roc_auc").mean() print metric return -metric space = {"max_depth":hp.randint("max_depth",15), "n_estimators":hp.randint("n_estimators",10), #[0,1,2,3,4,5] -> [50,] "learning_rate":hp.randint("learning_rate",6), #[0,1,2,3,4,5] -> 0.05,0.06 "subsample":hp.randint("subsample",4),#[0,1,2,3] -> [0.7,0.8,0.9,1.0] "min_child_weight":hp.randint("min_child_weight",5), # } algo = partial(tpe.suggest,n_startup_jobs=1) best = fmin(GBM,space,algo=algo,max_evals=4) print best print GBM(best)
詳細參考:http://blog.csdn.net/qq_34139222/article/details/60322995
python調參神器hyperopt