特徵選擇---SelectKBest
from sklearn.feature_selection import SelectKBest http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_selection.SelectKBest.set_params class SelectKBest(_BaseFilter): """Select features according to the k highest scores. Read more in the :ref:`User Guide <univariate_feature_selection>`. Parameters ---------- score_func : callable Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below "See also"). The default function only works with classification tasks. k : int or "all", optional, default=10 Number of top features to select. The "all" option bypasses selection, for use in a parameter search. Attributes ---------- scores_ : array-like, shape=(n_features,) Scores of features. pvalues_ : array-like, shape=(n_features,) p-values of feature scores, None if `score_func` returned only scores. Notes ----- Ties between features with equal scores will be broken in an unspecified way. See also -------- f_classif: ANOVA F-value between label/feature for classification tasks. mutual_info_classif: Mutual information for a discrete target. chi2: Chi-squared stats of non-negative features for classification tasks. f_regression: F-value between label/feature for regression tasks. mutual_info_regression: Mutual information for a continuous target. SelectPercentile: Select features based on percentile of the highest scores. SelectFpr: Select features based on a false positive rate test. SelectFdr: Select features based on an estimated false discovery rate. SelectFwe: Select features based on family-wise error rate. GenericUnivariateSelect: Univariate feature selector with configurable mode."""
官網的一個例子(需要自己給出計算公式、和k值)
引數
1、score_func : callable,函式取兩個陣列X和y,返回一對陣列(scores, pvalues)或一個分數的陣列。預設函式為f_classif,預設函式只適用於分類函式。
2、k:int or "all", optional, default=10。所選擇的topK個特徵。“all”選項則繞過選擇,用於引數搜尋。
屬性
1、scores_ : array-like, shape=(n_features,),特徵的得分
2、pvalues_ : array-like, shape=(n_features,),特徵得分的p_value值,如果score_func只返回分數,則返回None。
score_func裡可選的公式
方法
1、fit(X,y),在(X,y)上執行記分函式並得到適當的特徵。
2、fit_transform(X[, y]),擬合數據,然後轉換資料。
3、get_params([deep]),獲得此估計器的引數。
4、get_support([indices]),獲取所選特徵的掩碼或整數索引。
5、inverse_transform(X),反向變換操作。
6、set_params(**params),設定估計器的引數。
7、transform(X),將X還原為所選特徵。
如何返回選擇特徵的名稱或者索引。其實在上面的方法中已經提了一下了,那就是get_support()
之前的digit資料是不帶特徵名稱的,我選擇了帶特徵的波士頓房價資料,因為是迴歸資料,所以計算的評價指標也跟著變換了,f_regression,這裡需要先fit一下,才能使用get_support()。裡面的引數如果索引選擇True,
返回值就是feature的索引,可能想直接返回feature name在這裡不能這麼直接的呼叫了,但是在dataset裡面去對應一下應該很容易的。這裡我給出的K是5,選擇得分最高的前5個特徵,分別是第2,5,9,10,12個屬性。
如果裡面的引數選擇了False,返回值就是該特徵是否被選擇的Boolean值。
連結:https://www.jianshu.com/p/586ba8c96a3d