聚類效果的外部評價指標——純度(Purity)及其Python和matlab實現

阿新 • • 發佈：2020-12-31

0. 前言

我的課題中有一部分是評價聚類結果的好壞，很多論文中用正確率來評價。對此，我一直持懷疑態度，因為在相關書籍中並沒有找到“正確率”這一說法，只有分類的時候才用到。若要評價分類結果，Python中直接呼叫sklearn庫中的accuracy_score就可以得出準確率。
那麼聚類的“正確率”如何定義又如何計算呢？假設有5個有標籤的目標，對應標籤表示為y_true=[0,0,0,1,1]，根據聚類演算法的輸出是y_pre=[1,1,1,,0,0]，此時聚類結果是完全正確的，因為演算法把前三者歸為一類，後兩者歸為一類，只不過表述的不同。若聚類演算法的輸出是y_pre=[1,1,1,,0,-1]

，顯然該演算法將最後一個目標劃分錯誤，此時的“準確率”=0.8 。

1. 純度(Purity)

後面仔細查詢相關文獻後，發現聚類效果有一個評價指標——純度(Purity)。
這裡引用文獻中的例子來說明，假設聚類演算法的聚類結果如下圖所示，可以看出，聚類演算法把樣本劃分為3個簇：cluster1,2,3。cluster1中x最多，把cluster1看作是x的簇。cluster2中o最多，就看做是o的簇。cluster2中◇最多，就看做是◇的簇。而cluster1中有5個x，cluster2中有4個o，cluster3中有3個◇，總樣本數是17個。
那麼，此次聚類結果的純度 P u r i t y = 5 + 4 + 3 17 = 0.71 Purity=\frac{5+4+3}{17}=0.71

Purity=175+4+3=0.71。
在這裡插入圖片描述

現給出純度的計算公式：

P u r i t y = ∑ i = 1 k m i m p i Purity=\sum_{i=1}^{k}{\frac{m_i}{m}{p_i}} Purity=i=1∑kmmipi

可以發現，純度就是前言中我一直尋找的所謂“準確率”。

2. 純度的Python實現

這裡主要摘自:https://cloud.tencent.com/developer/ask/189986

from sklearn.metrics import accuracy_score
import numpy as np

def purity_score 
(y_true, y_pred):
    """Purity score
        Args:
            y_true(np.ndarray): n*1 matrix Ground truth labels
            y_pred(np.ndarray): n*1 matrix Predicted clusters

        Returns:
            float: Purity score
    """
    # matrix which will hold the majority-voted labels
    y_voted_labels = np.zeros(y_true.shape)
    # Ordering labels
    ## Labels might be missing e.g with set like 0,2 where 1 is missing
    ## First find the unique labels, then map the labels to an ordered set
    ## 0,2 should become 0,1
    labels = np.unique(y_true)
    ordered_labels = np.arange(labels.shape[0])
    for k in range(labels.shape[0]):
        y_true[y_true==labels[k]] = ordered_labels[k]
    # Update unique labels
    labels = np.unique(y_true)
    # We set the number of bins to be n_classes+2 so that 
    # we count the actual occurence of classes between two consecutive bins
    # the bigger being excluded [bin_i, bin_i+1[
    bins = np.concatenate((labels, [np.max(labels)+1]), axis=0)

    for cluster in np.unique(y_pred):
        hist, _ = np.histogram(y_true[y_pred==cluster], bins=bins)
        # Find the most present label in the cluster
        winner = np.argmax(hist)
        y_voted_labels[y_pred==cluster] = winner

    return accuracy_score(y_true, y_voted_labels)

注：函式purity_score()的輸入y_true和y_pred都得是numpy格式

測試程式碼：

y_true = np.array([0, 0, 0, 1, 1, 1, 2])
y_pre = np.array([1, 1, 1, 2, 2, 2, 2])

print("純度為:",purity_score(y_true,y_pre))

測試結果：

真的是太好了！！！

3. matlab程式碼

這裡摘自部落格

function [FMeasure,Accuracy] = Fmeasure(P,C)
% P為人工標記簇
% C為聚類演算法計算結果
N = length(C);% 樣本總數
p = unique(P);
c = unique(C);
P_size = length(p);% 人工標記的簇的個數
C_size = length(c);% 演算法計算的簇的個數
% Pid,Rid：非零資料：第i行非零資料代表的樣本屬於第i個簇
Pid = double(ones(P_size,1)*P == p'*ones(1,N) );
Cid = double(ones(C_size,1)*C == c'*ones(1,N) );
CP = Cid*Pid';%P和C的交集,C*P
Pj = sum(CP,1);% 行向量，P在C各個簇中的個數
Ci = sum(CP,2);% 列向量，C在P各個簇中的個數
 
precision = CP./( Ci*ones(1,P_size) );
recall = CP./( ones(C_size,1)*Pj );
F = 2*precision.*recall./(precision+recall);
% 得到一個總的F值
FMeasure = sum( (Pj./sum(Pj)).*max(F) );
Accuracy = sum(max(CP,[],2))/N;
end

測試結果：
在這裡插入圖片描述

4.更多的評價指標

關於更多的聚類的外部評價指標參考部落格

聚類效果的外部評價指標——純度(Purity)及其Python和matlab實現

0. 前言

1. 純度(Purity)

2. 純度的Python實現

測試程式碼：

3. matlab程式碼

4.更多的評價指標

聚類效果的外部評價指標——純度(Purity)及其Python和matlab實現

PCL歐式聚類效果顯示

MATLAB聚類有效性評價指標（內部） MATLAB聚類有效性評價指標（外部）MATLAB聚類有效性評價指標（外部成對度量）

python資料分析：流量資料化運營（下）——基於自動K值得KMeans廣告效果聚類分析

聚類指標

拓端tecdat|R語言用有限混合模型(FMM,finite mixture model)建立衰退指標對股市SPY、ETF收益聚類和雙座標圖視覺化

Java如何基於ProcessBuilder類呼叫外部程式

python基於K-means聚類演算法的影象分割

python聚類演算法解決方案（rest介面/mpp資料庫/json資料/下載圖片及資料）

Python 線性迴歸分析以及評價指標詳解

在Python中使用K-Means聚類和PCA主成分分析進行影象壓縮

python實現密度聚類(模板程式碼+sklearn程式碼)

python 程式碼實現k-means聚類分析的思路(不使用現成聚類庫)

k-means 聚類演算法與Python實現程式碼

python實現mean-shift聚類演算法

淺談keras中自定義二分類任務評價指標metrics的方法以及程式碼

簡單的k-means聚類

基於圖嵌入的高斯混合變分自編碼器的深度聚類(Deep Clustering by Gaussian Mixture Variational Autoencoders with Graph Embedding, DGG)

第八天學習進度--Kmeans的應用之文字聚類

機器學習實戰---K均值聚類演演算法

聚類效果的外部評價指標——純度(Purity)及其Python和matlab實現

0. 前言

1. 純度(Purity)

2. 純度的Python實現

測試程式碼：

3. matlab程式碼

4.更多的評價指標

相關推薦