4. K-Means和K-Means++實現
阿新 • • 發佈:2019-01-15
初始 inline 第一個 new 修改 selected 加速 machine mage
1. K-Means原理解析
2. K-Means的優化
3. sklearn的K-Means的使用
4. K-Means和K-Means++實現
1. 前言
前面3篇K-Means的博文從原理、優化、使用幾個方面詳細的介紹了K-Means算法,本文用python算法,詳細的為讀者實現一下K-Means。代碼是本人修改完成,效率雖遠不及sklearn,但是它的作用是在幫助同學們能從代碼中去理解K-Means算法。後面我會慢慢的把所有的機器學習方面的算法,盡我所能的去實現一遍。
2. KMeans基本框架實現
先實現一個基本的kmeans,代碼如下,需要查看完整代碼的同學請移步至我的github:
class KMeansBase(object): def __init__(self, n_clusters = 8, init = "random", max_iter = 300, random_state = None, n_init = 10, tol = 1e-4): self.k = n_clusters # 聚類個數 self.init = init # 輸出化方式 self.max_iter = max_iter # 最大叠代次數 self.random_state = check_random_state(random_state) #隨機數 self.n_init = n_init # 進行多次聚類,選擇最好的一次 self.tol = tol # 停止聚類的閾值 # fit對train建立模型 def fit(self, dataset): self.tol = self._tolerance(dataset, self.tol) bestError = None bestCenters = None bestLabels = None for i in range(self.n_init): labels, centers, error = self._kmeans(dataset) if bestError == None or error < bestError: bestError = error bestCenters = centers bestLabels = labels self.centers = bestCenters return bestLabels, bestCenters, bestError # predict根據訓練好的模型預測新的數據 def predict(self, X): return self.update_labels_error(X, self.centers)[0] # 合並fit和predict def fit_predict(self, dataset): self.fit(dataset) return self.predict(dataset) # kmeans的主要方法,完成一次聚類的過程 def _kmeans(self, dataset): self.dataset = np.array(dataset) bestError = None bestCenters = None bestLabels = None centerShiftTotal = 0 centers = self._init_centroids(dataset) for i in range(self.max_iter): oldCenters = centers.copy() labels, error = self.update_labels_error(dataset, centers) centers = self.update_centers(dataset, labels) if bestError == None or error < bestError: bestLabels = labels.copy() bestCenters = centers.copy() bestError = error ## oldCenters和centers的偏移量 centerShiftTotal = np.linalg.norm(oldCenters - centers) ** 2 if centerShiftTotal <= self.tol: break #由於上面的循環,最後一步更新了centers,所以如果和舊的centers不一樣的話,再更新一次labels,error if centerShiftTotal > 0: bestLabels, bestError = self.update_labels_error(dataset, bestCenters) return bestLabels, bestCenters, bestError # k個數據點,隨機生成 def _init_centroids(self, dataset): n_samples = dataset.shape[0] centers = [] if self.init == "random": seeds = self.random_state.permutation(n_samples)[:self.k] centers = dataset[seeds] elif self.init == "k-means++": pass return np.array(centers) # 把tol和dataset相關聯 def _tolerance(self, dataset, tol): variances = np.var(dataset, axis=0) return np.mean(variances) * tol # 更新每個點的標簽,和計算誤差 def update_labels_error(self, dataset, centers): labels = self.assign_points(dataset, centers) new_means = defaultdict(list) error = 0 for assignment, point in zip(labels, dataset): new_means[assignment].append(point) for points in new_means.values(): newCenter = np.mean(points, axis=0) error += np.sqrt(np.sum(np.square(points - newCenter))) return labels, error # 更新中心點 def update_centers(self, dataset, labels): new_means = defaultdict(list) centers = [] for assignment, point in zip(labels, dataset): new_means[assignment].append(point) for points in new_means.values(): newCenter = np.mean(points, axis=0) centers.append(newCenter) return np.array(centers) # 分配每個點到最近的center def assign_points(self, dataset, centers): labels = [] for point in dataset: shortest = float("inf") # 正無窮 shortest_index = 0 for i in range(len(centers)): val = distance(point[np.newaxis], centers[i]) if val < shortest: shortest = val shortest_index = i labels.append(shortest_index) return labels
上面是我實現的基本的以EM算法為基礎的一個KMeans的算法過程,我接口設計和參數形式盡量模範sklearn的方式,方面熟悉sklearn的同學接受起來比較快。
3. KMeans++實現
kmeans++的原理在之前有介紹。這裏為了配合代碼,再介紹一遍。
- 從輸入的數據點集合中隨機選擇一個點作為第一個聚類中心\(\mu_1\).
- 對於數據集中的每一個點\(x_i\),計算它與已選擇的聚類中心中最近聚類中心的距離.
\[ D(x_i) = arg\;min|x_i-\mu_r|^2\;\;r=1,2,...k_{selected} \] - 選擇一個新的數據點作為新的聚類中心,選擇的原則是:\(D(x)\)
- 重復2和3直到選擇出k個聚類質心。
- 利用這k個質心來作為初始化質心去運行標準的K-Means算法。
其中比較關鍵的是第2、3步,請看具體實現過程:
# kmeans++的初始化方式,加速聚類速度
def _k_means_plus_plus(self, dataset):
n_samples, n_features = dataset.shape
centers = np.empty((self.k, n_features))
# n_local_trials是每次選擇候選點個數
n_local_trials = None
if n_local_trials is None:
n_local_trials = 2 + int(np.log(self.k))
# 第一個隨機點
center_id = self.random_state.randint(n_samples)
centers[0] = dataset[center_id]
# closest_dist_sq是每個樣本,到所有中心點最近距離
# 假設現在有3個中心點,closest_dist_sq = [min(樣本1到3個中心距離),min(樣本2到3個中心距離),...min(樣本n到3個中心距離)]
closest_dist_sq = distance(centers[0, np.newaxis], dataset)
# current_pot所有最短距離的和
current_pot = closest_dist_sq.sum()
for c in range(1, self.k):
# 選出n_local_trials隨機址,並映射到current_pot的長度
rand_vals = self.random_state.random_sample(n_local_trials) * current_pot
# np.cumsum([1,2,3,4]) = [1, 3, 6, 10],就是累加當前索引前面的值
# np.searchsorted搜索隨機出的rand_vals落在np.cumsum(closest_dist_sq)中的位置。
# candidate_ids候選節點的索引
candidate_ids = np.searchsorted(np.cumsum(closest_dist_sq), rand_vals)
# best_candidate最好的候選節點
# best_pot最好的候選節點計算出的距離和
# best_dist_sq最好的候選節點計算出的距離列表
best_candidate = None
best_pot = None
best_dist_sq = None
for trial in range(n_local_trials):
# 計算每個樣本到候選節點的歐式距離
distance_to_candidate = distance(dataset[candidate_ids[trial], np.newaxis], dataset)
# 計算每個候選節點的距離序列new_dist_sq, 距離總和new_pot
new_dist_sq = np.minimum(closest_dist_sq, distance_to_candidate)
new_pot = new_dist_sq.sum()
# 選擇最小的new_pot
if (best_candidate is None) or (new_pot < best_pot):
best_candidate = candidate_ids[trial]
best_pot = new_pot
best_dist_sq = new_dist_sq
centers[c] = dataset[best_candidate]
current_pot = best_pot
closest_dist_sq = best_dist_sq
return centers
4. 效果比較
用kmeans_base和kmeans++和sklearn的kmeans對sklearn中自帶的數據集iris、boston房價、digits進行聚類,比較速度和聚類效果比較。
5. 總結
Kmeans的算法講解靠一段落,有興趣的同學們可以去實踐下我在優化中提到的另外兩個優化方法,elkan減少計算距離的次數,Mini Batch處理大樣本的情況下,計算的速度。
4. K-Means和K-Means++實現