1. 程式人生 > >sklearn聚類演算法評估方法 之各種係數

sklearn聚類演算法評估方法 之各種係數

python中的分群質量

部分內容來源於:機器學習評價指標大彙總
個人比較偏好的三個指標有:Calinski-Harabaz Index(未知真實index的模型評估)、Homogeneity, completeness and V-measure(聚類數量情況)、輪廓係數

1.1 Adjusted Rand index 調整蘭德係數

這裡寫圖片描述

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> 
metrics.adjusted_rand_score(labels_true, labels_pred) 0.24

.

1.2 Mutual Information based scores 互資訊

這裡寫圖片描述
Two different normalized versions of this measure are available, Normalized Mutual Information(NMI) and Adjusted Mutual Information(AMI). NMI is often used in the literature while AMI was proposed more recently and is normalized against chance:

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.adjusted_mutual_info_score(labels_true, labels_pred)  
0.22504

.

1.3 Homogeneity, completeness and V-measure

同質性homogeneity:每個群集只包含單個類的成員。
完整性completeness:給定類的所有成員都分配給同一個群集。

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]

>>> metrics.homogeneity_score(labels_true, labels_pred)  
0.66...

>>> metrics.completeness_score(labels_true, labels_pred) 
0.42...

兩者的調和平均V-measure:

>>> metrics.v_measure_score(labels_true, labels_pred)    
0.51...

.

1.4 Fowlkes-Mallows scores

The Fowlkes-Mallows score FMI is defined as the geometric mean of the pairwise precision and recall:
這裡寫圖片描述

>>> from sklearn import metrics
>>> labels_true = [0, 0, 0, 1, 1, 1]
>>> labels_pred = [0, 0, 1, 1, 2, 2]
>>>
>>> metrics.fowlkes_mallows_score(labels_true, labels_pred)  
0.47140...

.

1.5 Silhouette Coefficient 輪廓係數

這裡寫圖片描述

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.silhouette_score(X, labels, metric='euclidean')
...                                                      
0.55...

.

1.6 Calinski-Harabaz Index

這個計算簡單直接,得到的Calinski-Harabasz分數值ss越大則聚類效果越好。Calinski-Harabasz分數值ss的數學計算公式是(理論介紹來源於:用scikit-learn學習K-Means聚類):

這裡寫圖片描述
 也就是說,類別內部資料的協方差越小越好,類別之間的協方差越大越好,這樣的Calinski-Harabasz分數會高。
 在scikit-learn中, Calinski-Harabasz Index對應的方法是metrics.calinski_harabaz_score.
在真實的分群label不知道的情況下,可以作為評估模型的一個指標。
同時,數值越小可以理解為:組間協方差很小,組與組之間界限不明顯。
與輪廓係數的對比,筆者覺得最大的優勢:快!相差幾百倍!毫秒級

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
>>> labels = kmeans_model.labels_
>>> metrics.calinski_harabaz_score(X, labels)  
560.39...