sklearn特徵提取方法學習
阿新 • • 發佈:2019-01-28
sklearn.feature_extraction模組主要處理從原始資料中特徵提取,目前主要包括從文字或影象中提取特徵方法。
- sklearn.feature_extraction.DictVectorizer(dtype=<type ‘numpy.float64’>, separator=’=’,sparse=True, sort=True):將字典列表對映成矩陣
separator:啞變數構造新特徵使用的分隔符,預設‘=’
sparse:表示經過處理後的資料是否是稀疏標識,True表示生產稀疏矩陣,False表示生產陣列
sort:訓練時,特徵名稱和特徵值是否進行排序標識,預設True
①利用DictVectorizer特徵提取例子 a、字典的values都是數字型
b、字典values是字元型In [1]: from sklearn.feature_extraction import DictVectorizer ...: import numpy as np ...: v = DictVectorizer(dtype=np.float64,sparse=False,sort=True) ...: D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}] ...: X = v.fit_transform(D) ...: print(type(v.fit_transform(D))) ...: X ...: ...: <class 'numpy.ndarray'> Out[1]: array([[ 2., 0., 1.], [ 0., 1., 3.]])
從上面兩種情況看:若字典的values都是數字型,則將字典的鍵對映成特徵名稱,字典values對映成特徵值;若字典的values是字元型(即分類型別),則進行啞變數處理,將字元型的"字典字元鍵+分隔符separator+字典字元values"對映成特徵名稱,特徵值為0或1 ②屬性值 vocabulary_:返回特徵名稱及索引組成的字典In [3]: from sklearn.feature_extraction import DictVectorizer ...: measurements = [{'city': 'Dubai', 'temperature': 33.}, ...: {'city': 'London', 'temperature': 12.}, ...: {'city': 'San Fransisco', 'temperature': 18.} ] ...: vec = DictVectorizer(sparse=True,separator=':',sort=True) ...: print(type(vec.fit_transform(measurements))) ...: vec.fit_transform(measurements).toarray() ...: <class 'scipy.sparse.csr.csr_matrix'> Out[3]: array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]])
feature_names_:返回特徵名稱組成的列表In [4]: v.vocabulary_ Out[4]: {'bar': 0, 'baz': 1, 'foo': 2} In [5]: vec.vocabulary_ Out[5]: {'city:Dubai': 0, 'city:London': 1, 'city:San Fransisco': 2, 'temperatur e': 3}
In [6]: v.feature_names_
Out[6]: ['bar', 'baz', 'foo']
In [7]: vec.feature_names_
Out[7]: ['city:Dubai', 'city:London', 'city:San Fransisco', 'temperature']
③方法
- fit(X, y=None)
In [8]: v.fit(D)
Out[8]:
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
sparse=False)
In [9]: vec.fit(measurements)
Out[9]:
DictVectorizer(dtype=<class 'numpy.float64'>, separator=':', sort=True,
sparse=True)
- fit_transform(X, y=None)
In [10]: v.fit_transform(D)
Out[10]:
array([[ 2., 0., 1.],
[ 0., 1., 3.]])
In [11]: vec.fit_transform(measurements).toarray()
Out[11]:
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
In [12]: vec.fit_transform(measurements)
Out[12]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 6 stored elements in Compressed Sparse Row format>
- get_feature_names()
In [13]: v.get_feature_names()
Out[13]: ['bar', 'baz', 'foo']
In [14]: vec.get_feature_names()
Out[14]: ['city:Dubai', 'city:London', 'city:San Fransisco', 'temperature']
- inverse_transform(X, dict_type=<type ‘dict’>)
In [15]: v.inverse_transform(X,dict)
Out[15]: [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
In [16]: X2 = vec.fit_transform(measurements).toarray()
In [17]: X2
Out[17]:
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
In [18]: vec.inverse_transform(X2,dict_type=dict)
Out[18]:
[{'city:Dubai': 1.0, 'temperature': 33.0},
{'city:London': 1.0, 'temperature': 12.0},
{'city:San Fransisco': 1.0, 'temperature': 18.0}]
- restrict(support, indices=False)
In [19]: from sklearn.feature_extraction import DictVectorizer
...: from sklearn.feature_selection import SelectKBest,chi2
...: import numpy as np
...: v = DictVectorizer(dtype=np.float64,sparse=False,sort=True)
...: D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
...: X = v.fit_transform(D)
...: support = SelectKBest(chi2,k=2).fit(X,[0,1])
...: print(support.get_support())
...: print(v.restrict(support.get_support()))
...: v.get_feature_names()
...:
[ True False True]
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
sparse=False)
Out[19]: ['bar', 'foo']
- transform(X, y=None)
In [20]: v.transform(D)
Out[20]:
array([[ 2., 1.],
[ 0., 3.]])
- get_params(deep=True)
In [21]: v.get_params()
Out[21]: {'dtype': numpy.float64, 'separator': '=', 'sort': True, 'sparse': Fals
e}
- set_params(**params)
In [22]: v.set_params(sparse=True)
Out[22]:
DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
sparse=True)