Kmeans&HCA + iris資料集+python實現

阿新 • • 發佈：2018-12-31

基本的聚類分析演算法

K均值 (K-means)：
基於原型的、劃分的距離技術，它試圖發現使用者指定個數(K)的簇。
a. 隨機選取k箇中心點
b. 遍歷所有資料，將每個資料劃分到最近的中心點中
c. 計算每個聚類的平均值，並作為新的中心點
d. 重複2-3，直到這k箇中線點不再變化（收斂了），或執行了足夠多的迭代
時間複雜度：O(Inkm)
空間複雜度：O(nm)
層次凝聚聚類演算法(HCA - Hierarchical Agglomerative Clustering)：
主要思想就是，先把每一個樣本點當做一個聚類，然後不斷重複的將其中最近的兩個聚類合併（就是凝聚的含義），直到滿足迭代終止條件。
a. 將訓練樣本集中的每個資料點都當做一個聚類；
b. 計算每兩個聚類之間的距離，將距離最近的或最相似的兩個聚類進行合併；
c. 重複上述步驟，直到得到的當前聚類數是合併前聚類數的10%，即90%的聚類都被合併了；當然還可以設定其他終止條件，這樣設定是為了防止過度合併。
DBSCAN:
一種基於密度的劃分距離的演算法，簇的個數有演算法自動的確定，低密度中的點被視為噪聲而忽略，因此其不產生完全聚類。
自己嘗試過。。。好像對iris資料集不太好用，有空再研究
https://blog.csdn.net/weixin_43909872/article/details/85342540

用python+sklearn+iris資料集去實驗一下：

K均值 (K-means)

data = pd.read_csv("iris.csv")
data = np.mat(data)
y_pred = KMeans(n_clusters=3).fit(data[:, 1:5])
colors = 'gbycm'
y_pred_color = []
category = []
for pred in y_pred.labels_:
    if pred == -1:
        color = 'r'
    else:
        color = colors[pred]
    y_pred_color.append(color)

for type in data[:, 5]:
    if type == 'setosa':
        category.append(0)
    elif type == 'versicolor':
        category.append(1)
    elif type == 'virginica':
        category.append(2)
plt.scatter(data[:, 0].tolist(), category, c=y_pred_color)
plt.show()

實驗結果：
蠻好的分成了三類（綠藍黃），第二類和第三類裡面有些點和iris資料集裡的點不一樣，但感覺比DBScan的輸出好多了
在這裡插入圖片描述

層次凝聚聚類演算法(HCA - Hierarchical Agglomerative Clustering)：
a. 匯入iris資料集並計算每兩條資料間的歐式距離：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster, maxinconsts

def euclideanDistance(trainingInstance, testInstance):
    distance = 0
    for x in range(len(trainingInstance)):
        distance += np.square(trainingInstance[0,x] - testInstance[0,x])
    return np.sqrt(distance);

def calculateDistances(data):
    distances = []
    for x in range(len(data)):
        distancesForRow = []
        for y in range(len(data)):
                dist = euclideanDistance(data[x, :], data[y, :])
                distancesForRow.append(dist)
        distances.append(distancesForRow)

    return distances
    
data = pd.read_csv("iris.csv")
data = np.mat(data)
np.random.shuffle(data)
distances = calculateDistances(data[:, 1:5])

b. 進行聚類分析
先一直計算到最後一個簇，這裡用了scipy.cluster.hierarchy的linkage函式，計算每一步的聚類
debug得到下面的資料，每一行代表了一次聚類，比如第一行，是把第0個和第3個數據合起來了，合成後的這個簇裡有2個數據點
在這裡插入圖片描述

然後用fcluster函式去得到我們想要的聚類，這裡我們用了maxclust的方式，也就是聚類到最後剩下三個簇，用以和原始的iris資料做對比
fcluster(Z, maxCluster, criterion = ‘maxclust’)

def HCA(data, method='average', maxCluster = 5):
    '''HCA

    Arguments:
        data [[0, float, ...], [float, 0, ...]] -- distances between each document

    Keyword Arguments:
        method {str} -- [linkage method： single、complete、average、centroid、median、ward] (default: {'average'})
        threshold {float} -- the cluster No. to stop
    Return:
        maxCluster int -- max cluster No.
        cluster [[idx1, idx2,..], [idx3]] -- the index of each cluster
    '''
    data = np.array(data)
    Z = linkage(data, method=method)
    # assignments = fcluster(Z, threshold, criterion='distance')
    assignments = fcluster(Z, maxCluster, criterion = 'maxclust')

    clusterNo = assignments.max()
    indices = getClasterIndices(assignments)
    return clusterNo, indices


def getClasterIndices(assignments):

    n = assignments.max()
    indices = []
    for cluster_number in range(1, n + 1):
        indices.append(np.where(assignments == cluster_number)[0])

    return indices
    
clustersNo, indices = HCA(distances, maxCluster = 3)

c.顯示聚類結果並且和iris資料集原有資料對比
Y座標代表用HCA分類的結果，顏色代表原來iris資料集裡的種類
可以看到效果還可以
在這裡插入圖片描述

type = 0
color = 'b'
for indice in indices:
    type += 1
    for nodeIndex in indice:
        if data[nodeIndex, 5] == 'setosa':
            color = 'r'
        elif data[nodeIndex, 5] == 'versicolor':
            color = 'g'
        elif data[nodeIndex, 5] == 'virginica':
            color = 'y'
        plt.scatter(nodeIndex, type, c = color)

plt.show()

Kmeans&HCA + iris資料集+python實現

Kmeans&HCA + iris資料集+python實現

模式識別設計（Python程式設計）：IRIS資料集的Kmeans聚類與分解聚類法

Python 3實現k-鄰近演算法以及 iris 資料集分類應用

Python資料分析--Iris資料集實戰

Kmeans聚類算法及其 Python實現

Spark ML 基於Iris資料集進行資料建模及迴歸聚類綜合分析-Spark商業ML實戰

利用 sklearn SVM 分類器對 IRIS 資料集分類

基於決策樹模型對 IRIS 資料集分類

Pytorch 神經網路—自定義資料集上實現

用Iris資料集的屬性畫圖

Iris資料集用主成分分析MATLAB

TensorFlow入門教程：8：訓練資料之Iris資料集

TensorFlow入門教程：18：Iris資料集的線性迴歸訓練

多分類（softmax處理iris資料集）

FDDB人臉資料集/python影象批量處理

通過Thrift source向Flume傳送資料的Python實現

K-近鄰演算法-iris資料集

資料結構(Python實現)之連結串列

資料結構(Python實現)之佇列及棧

IRIS資料集介紹

Kmeans&HCA + iris資料集+python實現

相關推薦