Welcome ! This is Guanpx's blog.

阿新 • • 發佈：2019-01-15

利用kmeans演算法進行非監督分類

1.聚類與kmeans

引例:2004美國普選布什51.52% 克里48.48% 實際上，如果加以妥善引導，那麼一有小部分人就會轉換立場，那麼如何找到這一小部分人以及如何在有限預算採取措施吸引他們呢？答案就是聚類(<<機器學習實戰>>第十章)
kmeans,k均值演算法,屬於聚類演算法中的一種，屬於非監督學習。
聚類中的一個重要的知識就是”簇”，簡單說簇就是相似資料的集合，而在kmeans中主要是進行簇之間距離的運算，所以引入”質心”的概念，所謂質心就是代表這一簇的一個點(類比圓心)，由於簇中有很多點，那麼質心的選取就是利用了”均值”，簇中所有點的平均值就是簇的質心，通過簇，一堆資料被分成k類，這就成了演算法的名字“k均值”的直觀解釋.

2.kmeans虛擬碼以及思想

Kmeans是發現給定資料集的k個簇的演算法，k是使用者給定的。
主要工作流程虛擬碼如下

create k個點作為質心 (通常是隨機選取)
while任意一個簇存在變化時
—— for 資料集中的資料點
——— for 每個質心
————- 計算質心到點的距離
————- 打擂臺找到最小的兩者距離記錄id
——— 將資料點分配到最近的簇(打擂臺記錄了id)
—— 更新分配後的簇的質心(簇中所有點的均值)
返回質心列表以及分配的結果矩陣

3.二分-kmeans虛擬碼以及思想

主要思想
將每個簇一分為二選取最小更新

虛擬碼
while 簇個數小於k
—— for 每個簇
———- 記錄總誤差
———- 在給定的簇上進行k=2的kmeans演算法
———- 計算一分為二後的總誤差
—— 選擇最小誤差的那個簇進行劃分操作
返回簇以及分配情況

kmeans錯誤分類
kmeans的錯誤分類
kmeans正確分類
kmeans的正確分類
二分-kmeans未翻車

 # -*- coding:utf-8 -*-

from numpy import *
import pylab as pl


# 讀取二維座標 存放在list中
def loadDataSet(fileName):
    dataMat = []
    with open(fileName, 'rb') as txtFile:
        for line in txtFile.readlines():
            init = map(float, line.split())
            dataMat.append(init)
    return 
 dataMat


# 計算矩陣的歐氏距離 (兩點間直線距離)
def distEclud(vecA, vecB):
    return sqrt(sum(power(vecA - vecB, 2)))


# 生成k個隨機質心
def randCent(dataSet, k):
    n = shape(dataSet)[1]
    centroids = mat(zeros((k, n)))  # 生成k*n的二維mat矩陣
    for j in range(n):
        minJ = min(dataSet[:, j])
        rangeJ = float(max(dataSet[:, j]) - minJ)
        centroids[:, j] = minJ + rangeJ * random.rand(k, 1)  # 生成k*1的隨機數
    return centroids


def Kmeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    m = shape(dataSet)[0]  # m行資料 m個點
    clusterAssment = mat(zeros((m, 2)))  # 分類結果
    centroids = createCent(dataSet, k)  # 類質心矩陣
    clusterChanged = True  # 簇不在變化 停止分類
    while clusterChanged:
        clusterChanged = False
        for i in range(m):  # 迴圈找最近的質心
            minDist = inf
            minIndex = -1
            for j in range(k):
                distJI = distMeas(centroids[j, :], dataSet[i, :]) # 第i個點與第j個質心的距離
                if distJI < minDist:  # 打擂臺找最小值 記錄下標(簇序號)
                    minDist = distJI
                    minIndex = j
            if clusterAssment[i, 0] != minIndex:  # 最小簇不是其所在簇
                clusterChanged = True  # 繼續更新
            clusterAssment[i, :] = minIndex, minDist ** 2  # 更新簇序號 最小距離
        # print centroids
        for cent in range(k):  # 更新質心 nonzero返回非零矩陣
            nowInClust = dataSet[nonzero(clusterAssment[:, 0].A == cent)[0]]  # 得到這個簇裡所有的點
            centroids[cent, :] = mean(nowInClust, axis=0)  # 按照列求均值 得到更新的質心座標
    return centroids, clusterAssment


# pylab繪圖
def printPic(inMat, Assment, centroids, k):
    for cent in range(k):
        pl.plot(inMat[Assment[:, 0] == cent, 0], inMat[Assment[:, 0] == cent, 1], '+')
    pl.plot(centroids[:, 0], centroids[:, 1], "o")
    pl.show()

# 二分kmeans(基於kmeans)
def binKmeans(dataSet, k, distMeas=distEclud):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m, 2)))  # 分類結果 第i個例項 第j類 距離的平方
    centroid0 = mean(dataSet, axis=0).tolist()  # 初始質心
    centList = [centroid0]
    # 計算各點質心距離平方和
    for j in range(m):
        clusterAssment[j, 1] = distMeas(mat(centroid0), dataSet[j, :]) ** 2
    # 在未達到需要的質心之前進行--二分
    while (len(centList) < k):
        lowestSSE = inf  # 每次都為最大值
        for i in range(len(centList)):
         # 按照當前質心i分類 用來分配的點
            tempCluster = dataSet[nonzero(clusterAssment[:, 0].A == i)[0], :] 
            # 將當前質心所在類二分
            tempCentroidMat, tempSplitClusterAssment = Kmeans(tempCluster, 2, distMeas)  
            # 分配的點的sse和
            sseSplit = sum(tempSplitClusterAssment[:, 1])  
             # 其他未分配的點的sse和
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:, 0].A != i)[0], 1]) 
            if (sseNotSplit + sseSplit) < lowestSSE:
                bestCentToSplit = i
                bestNewCents = tempCentroidMat
                bestClustAss = tempSplitClusterAssment.copy()
                lowestSSE = sseSplit + sseNotSplit

        # 更新分類
        # 0 1 2 新分類的(1) 就是 0 1  2 (3)
        bestClustAss[nonzero(bestClustAss[:, 0].A == 1)[0], 0] = len(centList)  
        # 0 1 2 新分類的(0) 就是 0 1 (2)
        bestClustAss[nonzero(bestClustAss[:, 0].A == 0)[0], 0] = bestCentToSplit  
        # 更新質心
        centList[bestCentToSplit] = bestNewCents[0, :].tolist()[0]
        centList.append(bestNewCents[1, :].tolist()[0])  # 把兩個質心加上
        # 更新距離
        # 原來的tosplit更新為兩點
        clusterAssment[nonzero(clusterAssment[:, 0].A == bestCentToSplit)[0], :] = bestClustAss       
    return mat(centList), clusterAssment


if __name__ == "__main__":
    data = loadDataSet("testByGpx.txt")
    centroids, assment = binKmeans(array(data), 2)
    printPic(array(data), array(assment), array(centroids), 2)

'testByGpx 簡單測試資料如下'
'''
1 1
2 1
4 5
5 6
'''

Welcome ! This is Guanpx's blog.

利用kmeans演算法進行非監督分類 1.聚類與kmeans 引例:2004美國普選布什51.52% 克里48.48% 實際上，如果加以妥善引導，那麼一有小部分人就會轉換立場，那麼如何找到這一小部分人以及如何在有限預算採取措施吸引他們呢？答案就是聚類(&l

This is Chuanqi‘s Blog

1. 費米架構 FERMI架構圖 SM SM Streaming multi-processor

【Welcome to Smile-Huang 's Blog.】This Blog aims to share my experience with you. Please leave comments if you have any thoughts.

This Blog aims to share my experience with you. Please leave comments if you have any thoughts.

Welcome to Smile-Huang 's Blog.

#include<iostream> #include<string> #include<fstream> using namespace std; //字元數目為n的詞項k-gram數目為n+1-k //預定的閾值為0.1 #define threshold 0.1

Welcome to Feng.Chang's Blog

1、檢視當前登入使用者 [[email protected] ~]$ whatis w w (1) - Show who is logged on

welcome to 浩·C's blog

#include <iostream> using namespace std; /* 思路：注意題中說的“好晶片比壞晶片多”，所以對於每個晶片，可以記錄其他行的晶片對該

This file's format is not supported or you don't specify a correct format. 解決辦法

版本問題 body ecif 新版 ted you cor spec asp string path = @"c:\請假統計表.xlsx"; Workbook workBook = new Workbook(); workBoo

Welcome to JRX2015U43's blog!

【英文題目】 A sequence of N positive integers (10 < N < 100 000), each of them less than or equal 10000, and a positive integer S (S <

Welcome to yjjr's blog!

T1 yyy點餐題意給出長度為nnn的序列，求有所有不同的組合的代價總和（每種組合的代價為該組合內所有數之和）對於全部資料，有1≤n≤1000000,0≤ai<9982443531\

Kaspars Grosu on LinkedIn: "This is happening now it's not a dream not even Science fiction #innovation #tech #ai #tesla "

This is happening now it's not a dream not even Science fiction #innovation #tech #ai #tesla Friday 7 September 2018 Real life incident.. What happens whe

Welcome ! This is Guanpx's blog.

利用kmeans演算法進行非監督分類

1.聚類與kmeans

2.kmeans虛擬碼以及思想

Kmeans是發現給定資料集的k個簇的演算法，k是使用者給定的。
主要工作流程虛擬碼如下

3.二分-kmeans虛擬碼以及思想

主要思想
將每個簇一分為二選取最小更新

Welcome ! This is Guanpx's blog.

This is Chuanqi‘s Blog

【Welcome to Smile-Huang 's Blog.】This Blog aims to share my experience with you. Please leave comments if you have any thoughts.

Welcome to Smile-Huang 's Blog.

Welcome to Feng.Chang's Blog

welcome to 浩·C's blog

This file's format is not supported or you don't specify a correct format. 解決辦法

Welcome to JRX2015U43's blog!

Welcome to yjjr's blog!

Kaspars Grosu on LinkedIn: "This is happening now it's not a dream not even Science fiction #innovation #tech #ai #tesla "

What is your life purpose? @ Alex Pliutau's Blog

This is how Tesla's Autopilot system sees the idyllic streets of Paris

This Is The Robot Being Used To Prevent Tomorrow's Car Crashes

This is what it’s like to fail your interviews at Google.

Welcome to oopos's Blog

Welcome to ray's blog home page

Welcome to DaxinPai's Blog

Xcode 真機調試報錯：This application's application-identifier entitleme

The Struts dispatcher cannot be found. This is usually caused by using Struts

[SQL] - Attempted to read or write protected memory. This is often an indication that other memory is corrupt. 問題之解決

Welcome ! This is Guanpx's blog.

利用kmeans演算法進行非監督分類

1.聚類與kmeans

2.kmeans虛擬碼以及思想

Kmeans是發現給定資料集的k個簇的演算法，k是使用者給定的。 主要工作流程虛擬碼如下

3.二分-kmeans虛擬碼以及思想

主要思想 將每個簇一分為二 選取最小更新

相關推薦

Kmeans是發現給定資料集的k個簇的演算法，k是使用者給定的。
主要工作流程虛擬碼如下

主要思想
將每個簇一分為二選取最小更新