python之sklearn學習筆記

阿新 • • 發佈：2019-01-05

前言：本文是學習筆記。

sklearn介紹

scikit-learn是資料探勘與分析的簡單而有效的工具。
依賴於NumPy， SciPy和matplotlib。

它主要包含以下幾部分內容：

從功能來分：
- classification
- Regression
- Clustering
- Dimensionality reduction
- Model selection
- Preprocessing
從API模組來分：
- sklearn.base: Base classes and utility function
- sklearn.cluster: Clustering
- sklearn.cluster
  
  .bicluster: Biclustering
- sklearn.covariance: Covariance Estimators
- sklearn.model_selection: Model Selection
- sklearn.datasets: Datasets
- sklearn.decomposition: Matrix Decomposition
- sklearn.dummy: Dummy estimators
- sklearn.ensemble: Ensemble Methods
- sklearn.exceptions: Exceptions and warnings
- sklearn.feature_extraction
  
  : Feature Extraction
- sklearn.feature_selection: Feature Selection
- sklearn.gaussian_process: Gaussian Processes
- sklearn.isotonic: Isotonic regression
- sklearn.kernel_approximation: Kernel Approximation
- sklearn.kernel_ridge: Kernel Ridge Regression
- sklearn.discriminant_analysis: Discriminant Analysis
- sklearn.linear_model: Generalized Linear Models
- sklearn.manifold: Manifold Learning
- sklearn.metrics: Metrics
- sklearn.mixture: Gaussian Mixture Models
- sklearn.multiclass: Multiclass and multilabel classification
- sklearn.multioutput: Multioutput regression and classification
- sklearn.naive_bayes: Naive Bayes
- sklearn.neighbors: Nearest Neighbors
- sklearn.neural_network: Neural network models
- sklearn.calibration: Probability Calibration
- sklearn.cross_decomposition: Cross decomposition
- sklearn.pipeline: Pipeline
- sklearn.preprocessing: Preprocessing and Normalization
- sklearn.random_projection: Random projection
- sklearn.semi_supervised: Semi-Supervised Learning
- sklearn.svm: Support Vector Machines
- sklearn.tree: Decision Tree
- sklearn.utils: Utilities

就我目前的菜鳥級別，感覺經常用到的有clustering, classification(svm, tree, linear regression 等), decomposition, preprocessing, metrics等，所以先從這些地方學起來。

cluster

閱讀sklearn.cluster的API，可以發現裡面主要有兩個內容：一個是各種聚類方法的class如cluster.KMeans，一個是可以直接使用的聚類方法的函式如

sklearn.cluster.k_means(X, n_clusters, init='k-means++', 
    precompute_distances='auto', n_init=10, max_iter=300, 
    verbose=False, tol=0.0001, random_state=None, 
    copy_x=True, n_jobs=1, algorithm='auto', return_n_iter=False)

所以實際使用中，對應也有兩種方法。

在sklearn.cluster共有9種聚類方法，分別是

AffinityPropagation: 吸引子傳播
AgglomerativeClustering: 層次聚類
Birch
DBSCAN
FeatureAgglomeration: 特徵聚集
KMeans: K均值聚類
MiniBatchKMeans
MeanShift
SpectralClustering: 譜聚類

拿我們最熟悉的Kmeans舉例說明：

採用類構造器，來構造Kmeans聚類器

首先API中KMeans的建構函式為：

sklearn.cluster.KMeans(n_clusters=8,
     init='k-means++', 
    n_init=10, 
    max_iter=300, 
    tol=0.0001, 
    precompute_distances='auto', 
    verbose=0, 
    random_state=None, 
    copy_x=True, 
    n_jobs=1, 
    algorithm='auto'
    )

引數的意義：

n_clusters:簇的個數，即你想聚成幾類
init: 初始簇中心的獲取方法
n_init: 獲取初始簇中心的更迭次數
max_iter: 最大迭代次數（因為kmeans演算法的實現需要迭代）
tol: 容忍度，即kmeans執行準則收斂的條件
precompute_distances：是否需要提前計算距離
verbose: 冗長模式（不太懂是啥意思，反正一般不去改預設值）
random_state: 隨機生成簇中心的狀態條件。
copy_x: 對是否修改資料的一個標記，如果True，即複製了就不會修改資料。
n_jobs: 並行設定
algorithm: kmeans的實現演算法，有：'auto', 'full', 'elkan', 其中 'full'表示用EM方式實現

雖然有很多引數，但是都已經給出了預設值。所以我們一般不需要去傳入這些引數,引數的。可以根據實際需要來呼叫。下面給一個簡單的例子：

import numpy as np
from sklearn.cluster import KMeans
data = np.random.rand(100, 3) #生成一個隨機資料，樣本大小為100, 特徵數為3

#假如我要構造一個聚類數為3的聚類器
estimator = KMeans(n_clusters=3)#構造聚類器
estimator.fit(data)#聚類
label_pred = estimator.label_ #獲取聚類標籤
centroids = estimator.cluster_centers_ #獲取聚類中心
inertia = estimator.inertia_ # 獲取聚類準則的最後值

直接採用kmeans函式：

import numpy as np
from sklearn import cluster
data = np.random.rand(100, 3) #生成一個隨機資料，樣本大小為100, 特徵數為3
k = 3 # 假如我要聚類為3個clusters
[centroid, label, inertia] = cluster.k_means(data, k)

當然其他方法也是類似，具體使用要參考API。（學會閱讀API，習慣去閱讀API）

classification

分類是資料探勘或者機器學習中最重要的一個部分。不過由於經典的分類方法機制比較特性化，所以好像sklearn並沒有特別定製一個分類器這樣的class。
常用的分類方法有：

KNN最近鄰:sklearn.neighbors
logistic regression邏輯迴歸: sklearn.linear_model.LogisticRegression
svm支援向量機: sklearn.svm
Naive Bayes樸素貝葉斯: sklearn.naive_bayes
Decision Tree決策樹: sklearn.tree
Neural network神經網路: sklearn.neural_network

那麼下面以KNN為例（主要是Nearest Neighbors Classification）來看看怎麼使用這些方法：

from sklearn import neighbors, datasets

# import some data to play with
iris = datasets.load_iris()
n_neighbors = 15
X = iris.data[:, :2]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = iris.target

weights = 'distance' # also set as 'uniform'
clf = neighbors.KNeighborsClassifier(n_neighbors, weights=weights)
clf.fit(X, y)

# if you have test data, just predict with the following functions
# for example, xx, yy is constructed test data
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) # Z is the label_pred

再比如svm：

from sklearn import svm
X = [[0, 0], [1, 1]]
y = [0, 1]

#建立支援向量分類模型
clf = svm.SVC()

#擬合訓練資料，得到訓練模型引數
clf.fit(X, y)

#對測試點[2., 2.], [3., 3.]預測
res = clf.predict([[2., 2.],[3., 3.]])

#輸出預測結果值
print res


#get support vectors
print "support vectors:", clf.support_vectors_

#get indices of support vectors
print "indices of support vectors:", clf.support_ 

#get number of support vectors for each class
print "number of support vectors for each class:", clf.n_support_

當然SVM還有對應的迴歸模型SVR

from sklearn import svm
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = svm.SVR()
clf.fit(X, y) 
res = clf.predict([[1, 1]])
print res

邏輯迴歸

from sklearn import linear_model
X = [[0, 0], [1, 1]]
y = [0, 1]
logreg = linear_model.LogisticRegression(C=1e5)

#we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, y)

res = logreg.predict([[2, 2]])
print res

preprocessing

這一塊通常我要用到的是Scale操作。而Scale型別也有很多，包括：

StandardScaler
MaxAbsScaler
MinMaxScaler
RobustScaler
Normalizer
等其他預處理操作

對應的有直接的函式使用：scale(), maxabs_scale(), minmax_scale(), robust_scale(), normaizer()。

例如：

import numpy as np
from sklearn import preprocessing
X = np.random.rand(3,4)


#用scaler的方法
scaler = preprocessing.MinMaxScaler()
X_scaled = scaler.fit_transform(X)


#用scale函式的方法
X_scaled_convinent = preprocessing.minmax_scale(X)

decomposition

說一下NMF與PCA吧，這兩個比較常用。

import numpy as np
X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
model.fit(X)

print model.components_
print model.reconstruction_err_
print model.n_iter_

這裡說一下這個類下面fit()與fit_transform()的區別，前者僅訓練一個模型，沒有返回nmf後的分支，而後者除了訓練資料，並返回nmf後的分支。

PCA也是類似，只不過沒有那些初始化引數，如下：

import numpy as np
X = np.array([[1,1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]])
from sklearn.decomposition import PCA
model = PCA(n_components=2)
model.fit(X)

print model.components_
print model.n_components_
print model.explained_variance_
print model.explained_variance_ratio_
print model.mean_
print model.noise_variance_

metrics

上述聚類分類任務，都需要最後的評估。

分類

比如分類，有下面常用評價指標與metrics：

accuracy_score
auc
f1_score
fbeta_score
hamming_loss
hinge_loss
jaccard_similarity_score
log_loss
recall_score
…

下面例子求的是分類結果的準確率：

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
ac = accuracy_score(y_true, y_pred)
print ac
ac2 = accuracy_score(y_true, y_pred, normalize=False)
print ac2

其他指標的使用類似。

迴歸

迴歸的相關metrics包含且不限於以下：

mean_absolute_error
mean_squared_error
median_absolute_error
…

聚類

有以下常用評價指標（internal and external）：

adjusted_mutual_info_score
adjusted_rand_score
completeness_score
homogeneity_score
normalized_mutual_info_score
silhouette_score
v_measure_score
…

下面例子求的是聚類結果的NMI（標準互資訊），其他指標也類似。

from sklearn.metrics import normalized_mutual_info_score

y_pred = [0,0,1,1,2,2]
y_true = [1,1,2,2,3,3]

nmi = normalized_mutual_info_score(y_true, y_pred)
print nmi

當然除此之外還有更多其他的metrics。參考API。

datasets

sklearn本身也提供了幾個常見的資料集，如iris, diabetes, digits, covtype, kddcup99, boson, breast_cancer，都可以通過sklearn.datasets.load_iris類似的方法載入相應的資料集。它返回一個數據集。採用下列方式獲取資料與標籤。

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data 
y = iris.target

除了這些公用的資料集外，datasets模組還提供了很多資料操作的函式，如load_files, load_svmlight_file，以及很多data generators。

panda.io還提供了很多可load外部資料（如csv, excel, json, sql等格式）的方法。

還可以獲取mldata這個repos上的資料集。

python的功能還是比較強大。

當然資料集的load也可以通過自己寫readfile函式來讀寫檔案。

結束語

上述主要學習了我比較常用的一些功能。當熟悉python後，只要閱讀Scikit-learn API，一切都不是問題。

另外有必要時，可檢視這些常用函式的原始碼學習，加深對資料探勘常用演算法原理的理解。

python之sklearn學習筆記

前言：本文是學習筆記。 sklearn介紹 scikit-learn是資料探勘與分析的簡單而有效的工具。依賴於NumPy， SciPy和matplotlib。它主要包含以下幾部分內容：從功能來分： classification Regre

python之turtle學習筆記

turtle 模組 .Turtle：注意字母的大寫，用於生成一個 turtle 物件 .fd/forward：沿當前方向，畫筆繪製距離 .bk/backward：沿當前反方向，畫筆繪製距離 .lt/left：沿當前方向向左旋轉的角度

python之Gui學習筆記（新手上路）

GUI程式建立步驟： 1、 import Tkinter GUI繪製類 2、生成主視窗，設定主視窗屬性 3、生成控制元件，設定控制元件屬性 4、把控制元件新增到視窗中 5、給控制元件繫結事件 6、進入訊息迴圈示例： #coding:utf-8 import Tkinter root=

Python之numpy學習筆記(一) 標準資料型別

文章目錄一、numpy標準資料型別二、建立陣列的方式一、numpy標準資料型別當構建一個數組時，可以用一個字串引數來指定資料型別 np.zeros(10, dtype

Python之Pandas學習筆記(三) apply|applymap|map和grouppy

文章目錄一、apply|applymap|map 異同分析二、apply 與 grounpy 的綜合運用一、apply|applymap|map 異同分析分

Python之pandas學習筆記(二) 取數和篩選，插入行列，賦值

如下圖是pandas的Dataframe清洗資料步驟，包括資料行列選取，賦值，更改列名集重建索引的思維導圖：文章目錄一、資料選取-普通行列總結：二、資料選取-按條件篩選行列總結三、重建索引和更改列名總結

python之Django學習筆記（一）---搭建Django開發環境和一些基本命令

system cfile 生成 ont superuser 如果 session username ins 1、Django下載官方下載地址：https://www.djangoproject.com/download/ 2、Django安裝 linux/wind

python之Django學習筆記（二）---Django從工程創建、app創建到表建模在頁面的顯示

ins req sql charm 如果步驟 spa use view 創建工程：在命令行中切換目錄至需要創建工程的目錄，然後在命令行中輸入如下命令創建djangoTestPro工程 D:\PycharmProjects\untitled\MyTestPr

Python之xml學習筆記

金融行業數據交換所有 mov text dir 文件金融 utf xml處理模塊 xml是實現不同語言或程序之間進行數據交換的協議，跟json差不多，但json使用起來更簡單，至今很多傳統公司如金融行業的很多系統的接口還主要是xml。 xml的格式如下，就是通過&

python框架之 Tornado 學習筆記（一）

tornado pythontornado 一個簡單的服務器的例子：首先，我們需要安裝 tornado ，安裝比較簡單： pip install tornado 測試安裝是否成功，可以打開python 終端，輸入： import tornado.https

sklearn學習筆記之簡單線性回歸

學習實現數據 list 標準 code 線性模型它的擴展簡單線性回歸線性回歸是數據挖掘中的基礎算法之一，從某種意義上來說，在學習函數的時候已經開始接觸線性回歸了，只不過那時候並沒有涉及到誤差項。線性回歸的思想其實就是解一組方程，得到回歸函數，不過在出現誤差項之

sklearn學習筆記之開始

self. prefix strong uri num except 頗受 user 進行簡介 ??自2007年發布以來，scikit-learn已經成為Python重要的機器學習庫了。scikit-learn簡稱sklearn，支持包括分類、回歸、降維和聚類四大機器學習

Python:GUI之tkinter學習筆記2界面布局顯示

lB cfi ipa die uia 是否打包 top adp 相關內容： pack 介紹常用參數使用情況常用函數 grid 介紹常用參數使用情況常用函數 place 介紹常用參數使用情況常用函數首發時間：2

Python:GUI之tkinter學習筆記之messagebox、filedialog

files try OS line pad 錯誤 mes false erro 相關內容： messagebox 介紹使用 filedialog 介紹使用首發時間：2018-03-04 22:18 messagebox: 介紹：mes

【Python學習筆記】Coursera之PY4E學習筆記——File

color 學習筆記函數 read mod rom stephen 內容 filename 1、打開文件使用handle=open(filename,mode)打開文件。這一函數將會返回一個handle（應該翻譯為“柄”吧）用來操控文件，參數filename是一個字符串

Python全棧學習筆記day 17：遞迴函式之：二分法（老男孩Python全棧學習s9 day17 二分法程式有些問題）

遞迴函式遞迴：在函式中呼叫自身函式最大遞迴深度預設是997/998 —— 是python從記憶體角度出發做得限制二分法：實現程式：最基礎版：（很多問題：切分導致出現了新列表，無法返回元素在 l 中的位置） l = [2,3,5,10,15,16,

QT之Python開發QML學習筆記

#!/usr/bin/env python # -*- coding: utf-8 -*- import os import sys import json import urllib import urllib.parse import urllib.request import PySi

sklearn學習筆記之svm

支援向量機: # -*- coding: utf-8 -*- import sklearn from sklearn.svm import SVC import matplotlib.pyplot as plt from sklearn.model_selection import tra

sklearn學習筆記之神經網路

# -*- coding: utf-8 -*- import sklearn from sklearn.neural_network import MLPClassifier import matplotlib.pyplot as plt from sklearn.model_selectio

sklearn學習筆記之knn分類演算法

# -*- coding: utf-8 -*- import sklearn from sklearn import neighbors import matplotlib.pyplot as plt from sklearn.model_selection import train_test_

python之sklearn學習筆記

sklearn介紹

cluster

採用類構造器，來構造Kmeans聚類器

直接採用kmeans函式：

classification

preprocessing

decomposition

metrics

分類

迴歸

聚類

datasets

結束語

相關推薦