Dream team: Stacking for combining classifiers夢之隊:組合分類器
sklearn實戰-乳腺癌細胞數據挖掘
https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
將訓練好的所有基模型對整個訓練集進行預測,第j個基模型對第i個訓練樣本的預測值將作為新的訓練集中第i個樣本的第j個特征值,最後基於新的訓練集進行訓練。同理,預測的過程也要先經過所有基模型的預測形成新的測試集,最後再對測試集進行預測:
下面我們介紹一款功能強大的stacking利器,mlxtend庫,它可以很快地完成對sklearn模型地stacking。
主要有以下幾種使用方法吧:
I. 最基本的使用方法,即使用前面分類器產生的特征輸出作為最後總的meta-classifier的輸入數據
from sklearn import datasets iris = datasets.load_iris() X, y = iris.data[:, 1:3], iris.target from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from mlxtend.classifier import StackingClassifier import numpy as np clf1 = KNeighborsClassifier(n_neighbors=1) clf2 = RandomForestClassifier(random_state=1) clf3 = GaussianNB() lr = LogisticRegression() sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], meta_classifier=lr) print(‘3-fold cross validation:\n‘) for clf, label in zip([clf1, clf2, clf3, sclf], [‘KNN‘, ‘Random Forest‘, ‘Naive Bayes‘, ‘StackingClassifier‘]): scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring=‘accuracy‘) print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
II. 另一種使用第一層基本分類器產生的類別概率值作為meta-classfier的輸入,這種情況下需要將StackingClassifier的參數設置為 use_probas=True。如果將參數設置為 average_probas=True,那麽這些基分類器對每一個類別產生的概率值會被平均,否則會拼接。
例如有兩個基分類器產生的概率輸出為:
classifier 1: [0.2, 0.5, 0.3]
classifier 2: [0.3, 0.4, 0.4]
1) average = True :
產生的meta-feature 為:[0.25, 0.45, 0.35]
2) average = False:
產生的meta-feature為:[0.2, 0.5, 0.3, 0.3, 0.4, 0.4]
from sklearn import datasets iris = datasets.load_iris() X, y = iris.data[:, 1:3], iris.target from sklearn import model_selection from sklearn.linear_model import LogisticRegression from sklearn.neighbors import KNeighborsClassifier from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import RandomForestClassifier from mlxtend.classifier import StackingClassifier import numpy as np clf1 = KNeighborsClassifier(n_neighbors=1) clf2 = RandomForestClassifier(random_state=1) clf3 = GaussianNB() lr = LogisticRegression() sclf = StackingClassifier(classifiers=[clf1, clf2, clf3], use_probas=True, average_probas=False, meta_classifier=lr) print(‘3-fold cross validation:\n‘) for clf, label in zip([clf1, clf2, clf3, sclf], [‘KNN‘, ‘Random Forest‘, ‘Naive Bayes‘, ‘StackingClassifier‘]): scores = model_selection.cross_val_score(clf, X, y, cv=3, scoring=‘accuracy‘) print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
III. 另外一種方法是對訓練基中的特征維度進行操作的,這次不是給每一個基分類器全部的特征,而是給不同的基分類器分不同的特征,即比如基分類器1訓練前半部分特征,基分類器2訓練後半部分特征(可以通過sklearn 的pipelines 實現)。最終通過StackingClassifier組合起來。
from sklearn.datasets import load_iris from mlxtend.classifier import StackingClassifier from mlxtend.feature_selection import ColumnSelector from sklearn.pipeline import make_pipeline from sklearn.linear_model import LogisticRegression iris = load_iris() X = iris.data y = iris.target pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)), LogisticRegression()) pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)), LogisticRegression()) sclf = StackingClassifier(classifiers=[pipe1, pipe2], meta_classifier=LogisticRegression()) sclf.fit(X, y)
StackingClassifier 使用API及參數解析:
StackingClassifier(classifiers, meta_classifier, use_probas=False, average_probas=False, verbose=0, use_features_in_secondary=False)
參數:
classifiers : 基分類器,數組形式,[cl1, cl2, cl3]. 每個基分類器的屬性被存儲在類屬性 self.clfs_.
meta_classifier : 目標分類器,即將前面分類器合起來的分類器
use_probas : bool (default: False) ,如果設置為True, 那麽目標分類器的輸入就是前面分類輸出的類別概率值而不是類別標簽
average_probas : bool (default: False),用來設置上一個參數當使用概率值輸出的時候是否使用平均值。
verbose : int, optional (default=0)。用來控制使用過程中的日誌輸出,當 verbose = 0時,什麽也不輸出, verbose = 1,輸出回歸器的序號和名字。verbose = 2,輸出詳細的參數信息。verbose > 2, 自動將verbose設置為小於2的,verbose -2.
use_features_in_secondary : bool (default: False). 如果設置為True,那麽最終的目標分類器就被基分類器產生的數據和最初的數據集同時訓練。如果設置為False,最終的分類器只會使用基分類器產生的數據訓練。
屬性:
clfs_ : 每個基分類器的屬性,list, shape 為 [n_classifiers]。
meta_clf_ : 最終目標分類器的屬性
方法:
fit(X, y)
fit_transform(X, y=None, fit_params)
get_params(deep=True),如果是使用sklearn的GridSearch方法,那麽返回分類器的各項參數。
predict(X)
predict_proba(X)
score(X, y, sample_weight=None), 對於給定數據集和給定label,返回評價accuracy
set_params(params),設置分類器的參數,params的設置方法和sklearn的格式一樣
一套弱系統能變成一個強系統嗎?
當你處在一個復雜的分類問題面前時,金融市場通常會出現這種情況,在搜索解決方案時可能會出現不同的方法。 雖然這些方法可以估計分類,但有時候它們都不比其他分類好。 在這種情況下,合理的選擇是將它們全部保留下來,然後通過整合這些部分來創建最終系統。 這種多樣化的方法是最方便的做法之一:在幾個系統之間劃分決定,以避免把所有的雞蛋放在一個籃子裏。
一旦我對這種情況有了大量的估計,我怎樣才能將N個子系統的決策結合起來? 作為一個快速的答案,我可以做出平均決定並使用它。 但是,是否有不同的方式充分利用我的子系統? 當然有!
Can a set of weak systems turn into a single strong system?
When you’re in front of a complex classification problem, as is often the case with financial markets, different approaches may appear while searching for a solution. Although these approaches can estimate the classification, sometimes none of them are better than the rest. In this case, a reasonable choice is to keep them all, and then create a final system by integrating the pieces. This method of diversification is one of the most convenient practices: divide the decision among several systems in order to avoid putting all your eggs in one basket.
Once I have a number of estimates for the one case, how can I combine the decisions of the N sub-systems? As a quick answer, I can take the decision average and use this. But are there different ways of making the most out of my sub-systems? Of course there are!
Think outside the box!
Several classifiers with a common objective are called multiclassifiers. In Machine Learning, multiclassifiers are sets of different classifiers which make estimates and are fused together, obtaining a result that is a combination of them. Lots of terms are used to refer to multiclassifiers: multi-models, multiple classifier systems, combining classifiers, decision committee, etc. They can be divided into two main groups:
- Ensemble methods: Refers to sets of systems that combine to create a new system using the same learning technique. Bagging and Boosting are the most extended ones.
- Hybrid methods: Takes a set of different learners and combines them using new learning techniques. Stacking (or Stacked Generalization) is one of the main hybrid multiclassifiers.
創造性思考!
幾個具有共同目標的分類器稱為多分類器。 在機器學習中,多分類器是一組不同的分類器,它們進行估算並融合在一起,得到一個結合它們的結果。 許多術語用於指多分類器:多模型,多分類器系統,組合分類器,決策委員會等。它們可以分為兩大類:
集成方法:指使用相同的學習技術組合成一組系統來創建新系統。 套袋和提升是最延伸的。
混合方法:采用一組不同的學習者並使用新的學習技術進行組合。 堆疊(或堆疊泛化)是主要的混合多分類器之一。
How to build a multiclassifier motivated by Stacking.
Imagine that I would like to estimate the EURUSD’s trends(歐元兌美元趨勢). First of all, I turn my issue into a classification problem, so I split the price data into two types (or classes): up and down movements. Guessing every daily movement is not my intention. I only want to detect the main trends: up for trading Long (class = 1) and down for trading Short (class = 0).
I have done this split a posteriori; by which I mean that all historical data have been used to decide the classes, so it takes into account some future information. Therefore, I’m not able to assure iup or down movement at the current moment. For this reason an estimate for the today’s class is required.
For the purpose of this example I have designed three independent systems. They are three different learners using separate sets of attributes. It does not matter if you use the same learner algorithm or if they share some/all attributes; the key is that they must be different enough in order to guarantee diversification.
Then, they trade based on those probabilities: If E is above 50%, it means Long entry, more the bigger E is. If E is under 50%, it is Short entry, more the smaller E is.
These are the results of my three systems:
一組窮人可以組成夢之隊嗎?
構建多分類器的目的是獲得比任何單個分類器都能獲得的更好的預測性能。讓我們看看是否是這種情況。
我將在本例中使用的方法基於Stacking算法。 Stacking的思想是,稱為級別0模型的主分類器的輸出將被用作稱為元模型的另一分類器的屬性以近似相同的分類問題。元模型留下來找出合並機制。它將負責連接0級模型的回復和真實分類。
嚴格的過程包括將訓練集分成不相交的集合。然後訓練每個級別0的學習者關於整個數據,排除一組,並將其應用於排除組。通過對每組重復,為每個學習者獲得每個數據的估計。這些估計值將成為訓練元模型或1級模型的屬性。由於我的數據是一個時間序列,因此我決定使用第1天到第d-1天的集合來構建第d天的估計。
Can a set of poor players make up a dream team?
The purpose of building a multiclassifier is to obtain better predictive performance than what could be obtained from any single classifier. Let’s see if this is the case.
The method I am going to use in this example is based on the Stacking algorithm. The idea of Stacking is that the output of primary classifiers, called level 0 models, will be used as attributes for another classifier, called meta-model, to approximate the same classification problem. The meta-model is left to figure out the combining mechanism. It will be in charge of connecting the level 0 models’ replies and the real classification.
The rigorous process consists in splitting the training set into disjoint sets. Then train each level 0 learner on the whole data, excluding one set, and apply it over the excluded set. By repeating for each set, an estimate for each data is obtained for each learner. These estimates will be the attributes for training the meta-model or level 1 model. As my data was a time series, I decided to build the estimation for day d just using the set from day 1 to day d-1.
這與哪種模式配合使用?
元模型可以是分類樹,隨機森林,支持向量機......任何分類學習者都是有效的。 對於這個例子,我選擇了使用最近鄰居算法。 這意味著元模型將估計新數據的類別,以發現過去數據中0級分類的類似配置,然後將分配這些類似情況的類別。
讓我們看看我的夢之隊的成績是多麽的好......
Which model does this work with?
The meta-model can be a classification tree, a random forest, a support vector machine… Any classification learner is valid. For this example I chose to use a nearest neighbours algorithm. It means that the meta-model will estimate the class of the new data finding similar configurations of the level 0 classifications in past data, and then will assign the class of these similar situations.
Let’s see how good my dream team result is…
Conclusion
This is just one example of the huge amount of available multiclassifiers. They can help you not only to join your partial solutions into a unique answer by means of a modern and original technique, but to create a real dream team. There’s also an important margin for improvement in the way that the individual pieces are integrated into a single system.
So, next time you need to combine, spend more than a moment working on the possibilities. Avoid the traditional average by force of habit and explore more complex methods. They may surprise you with extra performance.
結論
這只是大量可用多分類器的一個例子。 他們不僅可以幫助您通過現代和獨創的技術將您的部分解決方案融入到獨特的答案中,而且可以創建一個真正的夢幻團隊。 單個組件被集成到一個系統中的方式也有一個重要的改進余地。
所以,下次你需要結合時,花更多的時間來研究可能性。 通過習慣的力量避免傳統的平均水平,並探索更復雜的方法。 他們可能會為你帶來額外的表現
python風控評分卡建模和風控常識
https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share
Dream team: Stacking for combining classifiers夢之隊:組合分類器