1. 程式人生 > 實用技巧 >使用樹狀圖和Cophenetic相關性在python中進行分層聚類

使用樹狀圖和Cophenetic相關性在python中進行分層聚類

介紹 (Introduction)

In this article, we will take a look at an alternative approach to K Means clustering, popularly known as the Hierarchical Clustering. The hierarchical Clustering technique differs from K Means or K Mode, where the underlying algorithm of how the clustering mechanism works is different. K Means relies on a combination of centroid and euclidean distance to form clusters, hierarchical clustering on the other hand uses agglomerative or divisive techniques to perform clustering. Hierarchical clustering allows visualization of clusters using dendrograms that can help in better interpretation of results through meaningful taxonomies. Creating a dendrogram doesn’t require us to specify the number of clusters upfront.

在本文中,我們將介紹K均值聚類的另一種方法,通常稱為“層次聚類”。 分層聚類技術不同於K均值或K模式,後者的聚類機制如何工作的基礎演算法不同。 K Means依靠質心和歐幾里得距離的組合來形成聚類,另一方面,層次聚類則使用凝聚或分裂技術進行聚類。 分層聚類允許使用樹狀圖視覺化聚類,這有助於通過有意義的分類法更好地解釋結果。 建立樹狀圖不需要我們預先指定群集數。

Programming languages like R, Python, and SAS allow hierarchical clustering to work with categorical data making it easier for problem statements with categorical variables to deal with.

諸如R,Python和SAS之類的程式語言允許分層聚類與分類資料一起使用,從而使帶有分類變數的問題陳述更易於處理。

層次聚類中的重要術語 (Important Terms in Hierarchical Clustering)

連結方法 (Linkage Methods)

Suppose there are (a) original observations a[0],…,a[|a|−1] in cluster (a) and (b) original objects b[0],…,b[|b|−1] in cluster (b), then in order to combine these clusters we need to calculate the distance between two clusters (a) and (b). Say a point (d) exists that hasn’t been allocated to any of the clusters, we need to compute the distance between cluster (a) to (d) and between cluster (b) to (d).

假設在群集(a)中有(a)個原始觀測值a [0],…,a [| a | -1],在(b)中有(b)個原始物件b [0],…,b [| b | -1]聚類(b),然後為了合併這些聚類,我們需要計算兩個聚類(a)和(b)之間的距離。 假設存在尚未分配給任何群集的點(d),我們需要計算群集(a)至(d)之間以及群集(b)至(d)之間的距離。

Now clusters usually have multiple points in them that require a different approach for the distance matrix calculation. Linkage decides how the distance between clusters, or point to cluster distance is computed. Commonly used linkage mechanisms are outlined below:

現在,群集中通常具有多個點,因此需要不同的距離矩陣計算方法。 連結決定如何計算聚類之間的距離或點到聚類的距離。 常用的連結機制概述如下:

  1. Single Linkage — Distances between the most similar members for each pair of clusters are calculated and then clusters are merged based on the shortest distance

    單一連結-計算每對叢集中最相似成員之間的距離,然後根據最短距離合並叢集
  2. Average Linkage — Distance between all members of one cluster is calculated to all other members in a different cluster. The average of these distances is then utilized to decide which clusters will merge

    平均連結-計算一個群集中所有成員到另一群集中所有其他成員之間的距離。 然後,利用這些距離的平均值來確定哪些聚類將合併
  3. Complete Linkage — Distances between the most dissimilar members for each pair of clusters are calculated and then clusters are merged based on the shortest distance

    完整連結-計算每對叢集中最不相似的成員之間的距離,然後根據最短距離合並叢集
  4. Median Linkage — Similar to the average linkage, but instead of using the average distance, we utilize the median distance

    中值連結—與平均連結類似,但我們不使用平均距離,而是使用中值距離
  5. Ward Linkage — Uses the analysis of variance method to determine the distance between clusters

    病房連結—使用方差分析方法確定聚類之間的距離
  6. Centroid Linkage — Calculates the centroid of each cluster by taking the average of all points assigned to the cluster and then calculates the distance to other clusters using this centroid

    質心連結—通過獲取分配給該群集的所有點的平均值來計算每個群集的質心,然後使用該質心計算到其他群集的距離

These formulas for distance calculation is illustrated in Figure 1 below.

這些用於距離計算的公式如下圖1所示。

Image for post
Figure 1. Distance formulas for Linkages mentioned above. Image Credit — Developed by the Author
圖1.上面提到的連結的距離公式。 圖片信用-由作者開發

距離計算 (Distance Calculation)

Distance between two or more clusters can be calculated using multiple approaches, the most popular being Euclidean Distance. However, other distance metrics like Minkowski, City Block, Hamming, Jaccard, Chebyshev, etc. can also be used with hierarchical clustering. Figure 2 below outlines how hierarchical clustering is influenced by different distance metrics.

可以使用多種方法來計算兩個或多個聚類之間的距離,最流行的是歐幾里得距離。 但是,其他距離度量標準(例如Minkowski,City Block,Hamming,Jaccard,Chebyshev等)也可以與層次聚類一起使用。 下面的圖2概述了層次聚類如何受到不同距離度量的影響。

Image for post
Figure 2. Impact of distance calculation and linkage on cluster formation. Image Credits: Image credit — GIF via Gfycat.
圖2.距離計算和連結對叢集形成的影響。 圖片信用:圖片信用-通過Gfycat的GIF。

樹狀圖 (Dendrogram)

A dendrogram is used to represent the relationship between objects in a feature space. It is used to display the distance between each pair of sequentially merged objects in a feature space. Dendrograms are commonly used in studying the hierarchical clusters before deciding the number of clusters appropriate to the dataset. The distance at which two clusters combine is referred to as the dendrogram distance. The dendrogram distance is a measure of if two or more clusters are disjoint or can be combined to form one cluster together.

樹狀圖用於表示特徵空間中物件之間的關係。 它用於顯示特徵空間中每對順序合併的物件之間的距離。 在確定適合資料集的聚類數量之前,樹狀圖通常用於研究層次聚類。 兩個簇組合的距離稱為樹狀圖距離。 樹狀圖距離是兩個或更多簇不相交或可以組合在一起形成一個簇的度量。

Image for post
Figure 3. Dendrogram a hierarchical clustering using Median as the Linkage Type. Image Credits — Developed by the Author using Jupyter Notebook
圖3.使用“中位數”作為“連結型別”的樹狀圖。 影象學分—由作者使用Jupyter Notebook開發
Image for post
Figure 4. Dendrogram of a hierarchical clustering using Average as the Linkage Type. Image Credits — Developed by the Author using Jupyter Notebook
圖4.使用Average作為連結型別的層次聚類的樹狀圖。 影象學分—由作者使用Jupyter Notebook開發
Image for post
Figure 5. Dendrogram of a hierarchical clustering using Complete as the Linkage Type. Image Credits — Developed by the Author using Jupyter Notebook
圖5.使用“完整”作為“連結型別”的分層群集的樹狀圖。 影象學分—由作者使用Jupyter Notebook開發

態係數 (Cophenetic Coefficient)

Figures 3, 4, and 5 above signify how the choice of linkage impacts the cluster formation. Visually looking into every dendrogram to determine which clustering linkage works best is challenging and requires a lot of manual effort. To overcome this we introduce the concept of Cophenetic Coefficient.

上面的圖3、4和5表示連線的選擇如何影響簇的形成。 目視檢查每個樹狀圖以確定哪種聚類連結最有效是一項挑戰,需要大量的人工。 為了克服這個問題,我們引入了Cophenetic Coefficient的概念。

Imagine two Clusters, A and B with points A₁, A₂, and A₃ in Cluster A and points B₁, B₂, and B₃ in cluster B. Now for these two clusters to be well-separated points A₁, A₂, and A₃ and points B₁, B₂, and B₃ should be far from each other as well. Cophenet index is a measure of the correlation between the distance of points in feature space and distance on the dendrogram. It usually takes all possible pairs of points in the data and calculates the euclidean distance between the points. (remains the same, irrespective of which linkage algorithm we chose). It then computes the dendrogram distance at which clusters A & B combines. If the distance between these points increases with the dendrogram distance between the clusters then the Cophenet index is closer to 1.

想象一下兩個群集A和B,群集A中的點A 1,A 2和A 1,群集B中的B 1,B 2和B 1點。現在,這兩個群集是點A 1,A 2和A 3以及點B 1完全分離的,B 2和B 1也應該彼此遠離。 Cophenet索引是度量特徵空間中的點的距離與樹狀圖上的距離之間的相關性的量度。 通常,它會獲取資料中所有可能的點對,並計算這些點之間的歐式距離。 (無論我們選擇哪種連結演算法,都保持不變)。 然後,計算群集A和B合併的樹狀圖距離。 如果這些點之間的距離隨簇之間的樹狀圖距離而增加,則Cophenet指數更接近1。

Image for post
Figure 6. Cophenet index of different Linkage Methods in hierarchical clustering. Image Credits — Developed by the Author using Jupyter Notebook
圖6.層次聚類中不同連結方法的Cophenet索引。 影象學分—由作者使用Jupyter Notebook開發

確定群集數 (Deciding the Number of Clusters)

There are no statistical techniques to decide the number of clusters in hierarchical clustering, unlike a K Means algorithm that uses an elbow plot to determine the number of clusters. However, one common approach is to analyze the dendrogram and look for groups that combine at a higher dendrogram distance. Let’s take a look at the example below.

與使用彎頭圖確定簇數的K Means演算法不同,沒有統計技術可以確定層次聚類中的簇數。 但是,一種常見的方法是分析樹狀圖,並尋找在更大的樹狀圖距離處合併的組。 讓我們看下面的例子。

Image for post
Figure 7. Dendrogram of hierarchical clustering using the average linkage method. Image Credits — Developed by the Author using Jupyter Notebook
圖7.使用平均連結方法的層次聚類樹狀圖。 影象學分—由作者使用Jupyter Notebook開發

Figure 7 illustrates the presence of 5 clusters when the tree is cut at a Dendrogram distance of 3. The general idea being, all 5 groups of clusters combines at a much higher dendrogram distance and hence can be treated as individual groups for this analysis. We can also verify the same using a silhouette index score.

圖7說明了在樹形圖距離為3的情況下切割樹時存在5個聚類。一般的想法是,所有5組聚類在更高的樹形圖距離處合併,因此可以作為該分析的單個組。 我們還可以使用輪廓索引分數來驗證相同的結果。

結論 (Conclusion)

Deciding the number of clusters in any clustering exercise is a tedious task. Since the commercial side of the business is more focused on getting some meaning out of these groups, it is important to visualize the clusters in a two-dimensional space and check if they are distinct from each other. This can be achieved via PCA or Factor Analysis. This is a widely used mechanism to present the final results to different stakeholders that makes it easier for everyone to consume the output.

確定任何聚類活動中的聚類數量是一項繁瑣的任務。 由於業務的商業方面更專注於從這些組中獲取一些含義,因此在二維空間中視覺化群集並檢查它們是否彼此不同非常重要。 這可以通過PCA或因子分析來實現。 這是一種廣泛使用的機制,可以將最終結果呈現給不同的利益相關者,使每個人都更容易使用輸出。

Image for post
Image for post
Figure 8. Cluster visual of a hierarchical clustering using two different linkage techniques. Image Credits — Developed by the Author using Jupyter Notebook
圖8.使用兩種不同的連結技術的分層叢集的叢集外觀。 影象學分—由作者使用Jupyter Notebook開發

About the Author: Advanced analytics professional and management consultant helping companies find solutions for diverse problems through a mix of business, technology, and math on organizational data. A Data Science enthusiast, here to share, learn and contribute; You can connect with me on Linked and Twitter;

作者簡介:高階分析專家和管理顧問,通過組織資料的業務,技術和數學相結合,幫助公司找到各種問題的解決方案。 資料科學愛好者,在這裡分享,學習和貢獻; 您可以在 Linked Twitter上 與我 聯絡

翻譯自: https://towardsdatascience.com/hierarchical-clustering-in-python-using-dendrogram-and-cophenetic-correlation-8d41a08f7eab