1. 程式人生 > >K折交叉驗證-python

K折交叉驗證-python

1.Cross Validation (交叉驗證)

cross validation大概的意思是:對於原始資料我們要將其一部分分為train_data,一部分分為test_data。train_data用於訓練,test_data用於測試準確率。在test_data上測試的結果叫做validation_error。將一個演算法作用於一個原始資料,我們不可能只做出隨機的劃分一次train和test_data,然後得到一個validation_error,就作為衡量這個演算法好壞的標準。因為這樣存在偶然性。我們必須好多次的隨機的劃分train_data和test_data,分別在其上面算出各自的validation_error。這樣就有一組validation_error,根據這一組validation_error,就可以較好的準確的衡量演算法的好壞。

cross validation是在資料量有限的情況下的非常好的一個evaluate performance的方法。而對原始資料劃分出train data和test data的方法有很多種,這也就造成了cross validation的方法有很多種。

sklearn中的cross validation模組,最主要的函式是如下函式: 
sklearn.cross_validation.cross_val_score:他的呼叫形式是scores = cross_validation.cross_val_score(clf, raw_data, raw_target, cv=5, score_func=None)

引數解釋:

clf:表示的是不同的分類器,可以是任何的分類器。比如支援向量機分類器。clf = svm.SVC(kernel=’linear’, C=1); 
raw_data:原始資料; 
raw_target:原始類別標號; 
cv:代表的就是不同的cross validation的方法了。引用scikit-learn上的一句話(When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default, the latter being used if the estimator derives from ClassifierMixin.)如果cv是一個int數字的話,那麼預設使用的是KFold或者StratifiedKFold交叉,如果如果指定了類別標籤則使用的是StratifiedKFold。 
cross_val_score

:這個函式的返回值就是對於每次不同的的劃分raw_data時,在test_data上得到的分類的準確率。至於準確率的演算法可以通過score_func引數指定,如果不指定的話,是用clf預設自帶的準確率演算法。

scikit-learn的cross-validation交叉驗證程式碼:

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> cross_validation
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> svm
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>clf = svm.SVC(kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'linear'</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>scores = cross_validation.cross_val_score(clf, iris.data, iris.target, cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#5-fold cv</span>
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;"># change metrics</span>
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> metrics
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>cross_validation.cross_val_score(clf, iris.data, iris.target, cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, score_func=metrics.f1_score)
<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#f1 score: http://en.wikipedia.org/wiki/F1_score</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li></ul>

Note: if using LR, clf = LogisticRegression().

生成一個數據集做為交叉驗證

<code class="hljs lua has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> import numpy as np
>>> from sklearn.cross_validation import train_test_split
>>> X, y = np.arange(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>).reshape((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>)), range(<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
>>> X
array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]]</span>)
>>> list(y)
[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li></ul>

將資料切分為訓練集和測試集

<code class="hljs lua has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.33</span>, random_state=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">42</span>)
...
>>> X_train
array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[4, 5],
       [0, 1],
       [6, 7]]</span>)
>>> y_train
[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>]
>>> X_test
array(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[2, 3],
       [8, 9]]</span>)
>>> y_test
[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>]</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li></ul>

交叉驗證的使用

下面是手動劃分訓練集和測試集,控制檯中輸入下列程式碼進行測試:

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> numpy <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">as</span> np
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> cross_validation
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> datasets
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">from</span> sklearn <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">import</span> svm
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>iris = datasets.load_iris()
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>iris.data.shape, iris.target.shape
((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">150</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">150</span>,))
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>X_train, X_test, y_train, y_test = cross_validation.train_test_split(
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">... </span>    iris.data, iris.target, test_size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.4</span>, random_state=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>)
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>X_train.shape, y_train.shape
((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">90</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">90</span>,))
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>X_test.shape, y_test.shape
((<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">60</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4</span>), (<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">60</span>,))
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>clf = svm.SVC(kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'linear'</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>).fit(X_train, y_train)
<span class="hljs-prompt" style="color: rgb(0, 102, 102); box-sizing: border-box;">>>> </span>clf.score(X_test, y_test)                           
<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span>...</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li></ul>

下面是交叉驗證的例項:

<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> clf = svm.SVC(kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'linear'</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>)
>>> scores = cross_validation.cross_val_score(
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>    clf, iris.data, iris.target, cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>)
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>
>>> scores                                              
array([ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>        ])</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li></ul>

通過cross_validation,設定cv=5,進行5倍交叉驗證,最後得到一個scores的預測準確率陣列,表示每次交叉驗證得到的準確率。

<code class="hljs perl has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">print</span>(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"Accuracy: <span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%0</span>.2f (+/- <span class="hljs-variable" style="color: rgb(102, 0, 102); box-sizing: border-box;">%0</span>.2f)"</span> % (scores.mean(), scores.std() * <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>))
Accuracy: <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>.<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">98</span> (+<span class="hljs-regexp" style="color: rgb(0, 136, 0); box-sizing: border-box;">/- 0.03)</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li></ul>

通過scores.mean()求出平均值,得到平均精度。還可以通過指定scoring來設定準確率演算法

<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn import metrics
>>> scores = cross_validation.cross_val_score(clf, iris.data, iris.target,
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>     cv=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5</span>, scoring=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'f1_weighted'</span>)
>>> scores                                              
array([ <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>  <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.96</span><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>        ])</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

libsvm格式的資料匯入:

<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn.datasets import load_svmlight_file
>>> X_train, y_train = load_svmlight_file(<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"/path/to/train_dataset.txt"</span>)
<span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>
>>>X_train.todense()<span class="hljs-comment" style="color: rgb(136, 0, 0); box-sizing: border-box;">#將稀疏矩陣轉化為完整特徵矩陣</span></code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li></ul>

2.處理非均衡問題

對於正負樣本比例相差較大的非均衡問題,一種調節分類器的方法就是對分類器的訓練資料進行改造。一種是欠抽樣,一種是過抽樣。過抽樣意味著賦值樣例,而欠抽樣意味著刪除樣例。對於過抽樣,最後可能導致過擬合問題;而對於欠抽樣,則刪掉的樣本中可能包含某些重要的資訊,會導致欠擬合。對於正例樣本較少的情況下,通常採取的方式是使用反例類別的欠抽樣和正例類別的過抽樣相混合的方法

3.scikit-learn學習SVM

<code class="hljs r has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> digits = datasets.load_digits()
>>> print digits.data
[[  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]
 [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]
 [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">16.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]
 <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, 
 [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]
 [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]
 [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span> <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]]
>>> digits.target
array([<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>, <span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">...</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>])
>>> digits.images[<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>]
array([[  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">15.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">11.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">9.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">4.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">11.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">7.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">14.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">5.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">12.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>],
       [  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">6.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">13.</span>,  <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,   <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>]])
>>> from sklearn import svm
>>> clf = svm.SVC(gamma=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>, C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100.</span>)
>>> clf.fit(digits.data[:-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>],digits.target[:-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])
SVC(C=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">100.0</span>, cache_size=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">200</span>, class_weight=None, coef0=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.0</span>, degree=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">3</span>,
  gamma=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>, kernel=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">'rbf'</span>, max_iter=-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>, probability=False,
  random_state=None, shrinking=True, tol=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.001</span>, verbose=False)
>>> clf.predict(digits.data[-<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>])
array([<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">8</span>])
>>> </code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li><li style="box-sizing: border-box; padding: 0px 5px;">6</li><li style="box-sizing: border-box; padding: 0px 5px;">7</li><li style="box-sizing: border-box; padding: 0px 5px;">8</li><li style="box-sizing: border-box; padding: 0px 5px;">9</li><li style="box-sizing: border-box; padding: 0px 5px;">10</li><li style="box-sizing: border-box; padding: 0px 5px;">11</li><li style="box-sizing: border-box; padding: 0px 5px;">12</li><li style="box-sizing: border-box; padding: 0px 5px;">13</li><li style="box-sizing: border-box; padding: 0px 5px;">14</li><li style="box-sizing: border-box; padding: 0px 5px;">15</li><li style="box-sizing: border-box; padding: 0px 5px;">16</li><li style="box-sizing: border-box; padding: 0px 5px;">17</li><li style="box-sizing: border-box; padding: 0px 5px;">18</li><li style="box-sizing: border-box; padding: 0px 5px;">19</li><li style="box-sizing: border-box; padding: 0px 5px;">20</li><li style="box-sizing: border-box; padding: 0px 5px;">21</li><li style="box-sizing: border-box; padding: 0px 5px;">22</li><li style="box-sizing: border-box; padding: 0px 5px;">23</li><li style="box-sizing: border-box; padding: 0px 5px;">24</li><li style="box-sizing: border-box; padding: 0px 5px;">25</li><li style="box-sizing: border-box; padding: 0px 5px;">26</li><li style="box-sizing: border-box; padding: 0px 5px;">27</li><li style="box-sizing: border-box; padding: 0px 5px;">28</li><li style="box-sizing: border-box; padding: 0px 5px;">29</li><li style="box-sizing: border-box; padding: 0px 5px;">30</li><li style="box-sizing: border-box; padding: 0px 5px;">31</li></ul>

3.scikit-learn學習RandomForest

使用例子

<code class="hljs lua has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;">>>> from sklearn.ensemble import RandomForestClassifier
>>> X = <span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">[[0, 0], [1, 1]]</span>
>>> Y = [<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>, <span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>]
>>> clf = RandomForestClassifier(n_estimators=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>)
>>> clf = clf.fit(X, Y)</code><ul class="pre-numbering" style="box-sizing: border-box; position: absolute; width: 50px; top: 0px; left: 0px; margin: 0px; padding: 6px 0px 40px; border-right-width: 1px; border-right-style: solid; border-right-color: rgb(221, 221, 221); list-style: none; text-align: right; background-color: rgb(238, 238, 238);"><li style="box-sizing: border-box; padding: 0px 5px;">1</li><li style="box-sizing: border-box; padding: 0px 5px;">2</li><li style="box-sizing: border-box; padding: 0px 5px;">3</li><li style="box-sizing: border-box; padding: 0px 5px;">4</li><li style="box-sizing: border-box; padding: 0px 5px;">5</li></ul>

Method

這裡寫圖片描述

randomForestClassifier分類器的初始值

<code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-keyword" style="color: rgb(0, 0, 136); box-sizing: border-box;">def</span> <span class="hljs-title" style="box-sizing: border-box;">__init__</span><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">(self,
     n_estimators=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">10</span>,
     criterion=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"gini"</span>,
     max_depth=None,
     min_samples_split=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">2</span>,
     min_samples_leaf=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,
     min_weight_fraction_leaf=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0.</span>,
     max_features=<span class="hljs-string" style="color: rgb(0, 136, 0); box-sizing: border-box;">"auto"</span>,
     max_leaf_nodes=None,
     bootstrap=True,
     oob_score=False,
     n_jobs=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">1</span>,
     random_state=None,
     verbose=<span class="hljs-number" style="color: rgb(0, 102, 102); box-sizing: border-box;">0</span>,
     warm_start=False,
</span></span></code><p><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;"><span class="hljs-params" style="color: rgb(102, 0, 102); box-sizing: border-box;">     class_weight=None)</span>:</span></code><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;">
</span></code><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;">
</span></code><code class="hljs python has-numbering" style="display: block; padding: 0px; color: inherit; box-sizing: border-box; font-family: 'Source Code Pro', monospace;font-size:undefined; white-space: pre; border-top-left-radius: 0px; border-top-right-radius: 0px; border-bottom-right-radius: 0px; border-bottom-left-radius: 0px; word-wrap: normal; background: transparent;"><span class="hljs-function" style="box-sizing: border-box;">http://www.360doc.com/content/16/0626/16/20558639_570898095.shtml
</span></code></p>