k-means Clustering of Movie

阿新 • • 發佈：2018-12-20

電影評分的 k 均值聚類

假設你是 Netflix 的一名資料分析師，你想要根據使用者對不同電影的評分研究使用者在電影品位上的相似和不同之處。瞭解這些評分對使用者電影推薦系統有幫助嗎？我們來研究下這方面的資料。

我們將使用的資料來自精彩的 MovieLens 使用者評分資料集。我們稍後將在 notebook 中檢視每個電影評分，先看看不同型別之間的評分比較情況。

資料集概述

該資料集有兩個檔案。我們將這兩個檔案匯入 pandas dataframe 中：

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from 
 scipy.sparse import csr_matrix
import helper

# Import the Movies dataset
movies = pd.read_csv('ml-latest-small/movies.csv')
movies.head()

movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

# Import the ratings dataset
ratings = pd.read_csv('ml-latest-small/ratings.csv')
ratings.head()

userId	movieId	rating	timestamp
0	1	31	2.5	1260759144
1	1	1029	3.0	1260759179
2	1	1061	3.0	1260759182
3	1	1129	2.0	1260759185
4	1	1172	4.0	1260759205

現在我們已經知道資料集的結構，每個表格中有多少條記錄。

print('The dataset contains: ', len(ratings), ' ratings of ', len(movies), ' movies.')

The dataset contains:  100004  ratings of  9125  movies.

愛情片與科幻片

我們先檢視一小部分使用者，並看看他們喜歡什麼型別的電影。我們將大部分資料預處理過程都隱藏在了輔助函式中，並重點研究聚類概念。在完成此 notebook 後，建議你快速瀏覽下 helper.py，瞭解這些輔助函式是如何實現的。

# Calculate the average rating of romance and scifi movies

genre_ratings = helper.get_genre_ratings(ratings, movies, ['Romance', 'Sci-Fi'], ['avg_romance_rating', 'avg_scifi_rating'])
genre_ratings.head()

avg_romance_rating	avg_scifi_rating
userId
1	3.50	2.40
2	3.59	3.80
3	3.65	3.14
4	4.50	4.26
5	4.08	4.00

函式 get_genre_ratings 計算了每位使用者對所有愛情片和科幻片的平均評分。我們對資料集稍微進行偏倚，刪除同時喜歡科幻片和愛情片的使用者，使聚類能夠將他們定義為更喜歡其中一種型別。

biased_dataset = helper.bias_genre_rating_dataset(genre_ratings, 3.2, 2.5)

print( "Number of records: ", len(biased_dataset))
biased_dataset.head()

Number of records:  183

userId	avg_romance_rating	avg_scifi_rating
0	1	3.50	2.40
1	3	3.65	3.14
2	6	2.90	2.75
3	7	2.93	3.36
4	12	2.89	2.62

可以看出我們有 183 位使用者，對於每位使用者，我們都得出了他們對看過的愛情片和科幻片的平均評分。

我們來繪製該資料集：

%matplotlib inline

helper.draw_scatterplot(biased_dataset['avg_scifi_rating'],'Avg scifi rating', biased_dataset['avg_romance_rating'], 'Avg romance rating')

png

我們可以在此樣本中看到明顯的偏差（我們故意建立的）。如果使用 k 均值將樣本分成兩組，效果如何？

# Let's turn our dataset into a list
X = biased_dataset[['avg_scifi_rating','avg_romance_rating']].values

通過 n_clusters = 2 準備 KMeans
將資料集 X 傳遞給 KMeans 的 fit_predict 方法，並將聚類標籤放入 predictions

# TODO: Import KMeans
from sklearn.cluster import KMeans

# TODO: Create an instance of KMeans to find two clusters
kmeans_1 = KMeans(n_clusters = 2)

# TODO: use fit_predict to cluster the dataset
predictions = kmeans_1.fit_predict(X)

# Plot
helper.draw_clusters(biased_dataset, predictions)

png

可以看出分組的依據主要是每個人對愛情片的評分高低。如果愛情片的平均評分超過 3 星，則屬於第一組，否則屬於另一組。

如果分成三組，會發生什麼？


# TODO: Create an instance of KMeans to find three clusters
kmeans_2 = KMeans(n_clusters = 3)

# TODO: use fit_predict to cluster the dataset
predictions_2 = kmeans_2.fit_predict(X)

# Plot
helper.draw_clusters(biased_dataset, predictions_2)

png

現在平均科幻片評分開始起作用了，分組情況如下所示：

喜歡愛情片但是不喜歡科幻片的使用者
喜歡科幻片但是不喜歡愛情片的使用者
即喜歡科幻片又喜歡愛情片的使用者

再新增一組

# TODO: Create an instance of KMeans to find four clusters
kmeans_3 = KMeans(n_clusters = 4)

# TODO: use fit_predict to cluster the dataset
predictions_3 = kmeans_3.fit_predict(X)

# Plot
helper.draw_clusters(biased_dataset, predictions_3)

png

可以看出將資料集分成的聚類越多，每個聚類中使用者的興趣就相互之間越相似。

選擇 K

我們可以將資料點拆分為任何數量的聚類。對於此資料集來說，正確的聚類數量是多少？

可以通過多種方式選擇聚類 k。我們將研究一種簡單的方式，叫做“肘部方法”。肘部方法會繪製 k 的上升值與使用該 k 值計算的總誤差分佈情況。

如何計算總誤差？一種方法是計算平方誤差。假設我們要計算 k=2 時的誤差。有兩個聚類，每個聚類有一個“圖心”點。對於資料集中的每個點，我們將其座標減去所屬聚類的圖心。然後將差值結果取平方（以便消除負值），並對結果求和。這樣就可以獲得每個點的誤差值。如果將這些誤差值求和，就會獲得 k=2 時所有點的總誤差。

現在的一個任務是對每個 k（介於 1 到資料集中的元素數量之間）執行相同的操作。

# Choose the range of k values to test.
# We added a stride of 5 to improve performance. We don't need to calculate the error for every k value
possible_k_values = range(2, len(X)+1, 5)

# Calculate error values for all k values we're interested in
errors_per_k = [helper.clustering_errors(k, X) for k in possible_k_values]

# Optional: Look at the values of K vs the silhouette score of running K-means with that value of k
list(zip(possible_k_values, errors_per_k))

[(2, 0.35588178764728251),
 (7, 0.37324118163771741),
 (12, 0.35650856326047475),
 (17, 0.3741137698024623),
 (22, 0.37718217339438476),
 (27, 0.36071909992215945),
 (32, 0.37104279808464452),
 (37, 0.3649882241766923),
 (42, 0.36895091450195883),
 (47, 0.37696003940733186),
 (52, 0.38716548900081571),
 (57, 0.35079775582937778),
 (62, 0.34916584233387205),
 (67, 0.34839937724907),
 (72, 0.34907390154971468),
 (77, 0.34837739216196456),
 (82, 0.3309353056966266),
 (87, 0.34005916910201761),
 (92, 0.32494553685658306),
 (97, 0.32418331059507227),
 (102, 0.31329160485165003),
 (107, 0.29407239955320186),
 (112, 0.27366896911138017),
 (117, 0.28906341363336779),
 (122, 0.27342563040040624),
 (127, 0.25219179857975438),
 (132, 0.25320773897416415),
 (137, 0.2412264569953621),
 (142, 0.21855949198498667),
 (147, 0.19924498428850082),
 (152, 0.18722856283659275),
 (157, 0.16447514022082693),
 (162, 0.14697529680439808),
 (167, 0.12609539969216882),
 (172, 0.096865005870864829),
 (177, 0.064230120163174503),
 (182, 0.054644808743169397)]

# Plot the each value of K vs. the silhouette score at that value
fig, ax = plt.subplots(figsize=(16, 6))
ax.set_xlabel('K - number of clusters')
ax.set_ylabel('Silhouette Score (higher is better)')
ax.plot(possible_k_values, errors_per_k)

# Ticks and grid
xticks = np.arange(min(possible_k_values), max(possible_k_values)+1, 5.0)
ax.set_xticks(xticks, minor=False)
ax.set_xticks(xticks, minor=True)
ax.xaxis.grid(True, which='both')
yticks = np.arange(round(min(errors_per_k), 2), max(errors_per_k), .05)
ax.set_yticks(yticks, minor=False)
ax.set_yticks(yticks, minor=True)
ax.yaxis.grid(True, which='both')

png

看了該圖後發現，合適的 k 值包括 7、22、27、32 等（每次執行時稍微不同）。聚類 (k) 數量超過該範圍將開始導致糟糕的聚類情況（根據輪廓分數）

我會選擇 k=7，因為更容易視覺化：

# TODO: Create an instance of KMeans to find seven clusters
kmeans_4 = KMeans(n_clusters=7)

# TODO: use fit_predict to cluster the dataset
predictions_4 = kmeans_4.fit_predict(X)

# plot
helper.draw_clusters(biased_dataset, predictions_4, cmap='Accent')

png

注意：當你嘗試繪製更大的 k 值（超過 10）時，需要確保你的繪製庫沒有對聚類重複使用相同的顏色。對於此圖，我們需要使用 matplotlib colormap ‘Accent’，因為其他色圖要麼顏色之間的對比度不強烈，要麼在超過 8 個或 10 個聚類後會重複利用某些顏色。

再加入動作片型別

到目前為止，我們只查看了使用者如何對愛情片和科幻片進行評分。我們再新增另一種型別，看看加入動作片型別後效果如何。

現在資料集如下所示：

biased_dataset_3_genres = helper.get_genre_ratings(ratings, movies, 
                                                     ['Romance', 'Sci-Fi', 'Action'], 
                                                     ['avg_romance_rating', 'avg_scifi_rating', 'avg_action_rating'])
biased_dataset_3_genres = helper.bias_genre_rating_dataset(biased_dataset_3_genres, 3.2, 2.5).dropna()

print( "Number of records: ", len(biased_dataset_3_genres))
biased_dataset_3_genres.head()

Number of records:  183

userId	avg_romance_rating	avg_scifi_rating	avg_action_rating
0	1	3.50	2.40	2.80
1	3	3.65	3.14	3.47
2	6	2.90	2.75	3.27
3	7	2.93	3.36	3.29
4	12	2.89	2.62	3.21

X_with_action = biased_dataset_3_genres[['avg_scifi_rating',
                                         'avg_romance_rating', 
                                         'avg_action_rating']].values

# TODO: Create an instance of KMeans to find seven clusters
kmeans_5 = KMeans(n_clusters=7)

# TODO: use fit_predict to cluster the dataset
predictions_5 = kmeans_5.fit_predict(X_with_action)

# plot
helper.draw_clusters_3d(biased_dataset_3_genres, predictions_5)

png

我們依然分別用 x 軸和 y 軸表示科幻片和愛情片。並用點的大小大致表示動作片評分情況（更大的點表示平均評分超過 3 顆星，更小的點表示不超過 3 顆星）。

可以看出新增型別後，使用者的聚類分佈發生了變化。為 k 均值提供的資料越多，每組中使用者之間的興趣越相似。但是如果繼續這麼繪製，我們將無法視覺化二維或三維之外的情形。在下個部分，我們將使用另一種圖表，看看多達 50 個維度的聚類情況。

電影級別的聚類

現在我們已經知道 k 均值會如何根據使用者的型別品位對使用者進行聚類，我們再進一步分析，看看使用者對單個影片的評分情況。為此，我們將資料集構建成 userId 與使用者對每部電影的評分形式。例如，我們來看看以下資料集子集：

# Merge the two tables then pivot so we have Users X Movies dataframe
ratings_title = pd.merge(ratings, movies[['movieId', 'title']], on='movieId' )
user_movie_ratings = pd.pivot_table(ratings_title, index='userId', columns= 'title', values='rating')

print('dataset dimensions: ', user_movie_ratings.shape, '\n\nSubset example:')
user_movie_ratings.iloc[:6, :10]

dataset dimensions:  (671, 9064) 

Subset example:

title	"Great Performances" Cats (1998)	$9.99 (2008)	'Hellboy': The Seeds of Creation (2004)	'Neath the Arizona Skies (1934)	'Round Midnight (1986)	'Salem's Lot (2004)	'Til There Was You (1997)	'burbs, The (1989)	'night Mother (1986)	(500) Days of Summer (2009)
userId
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	4.0	NaN	NaN

NaN 值的優勢表明了第一個問題。大多數使用者沒有看過大部分電影，並且沒有為這些電影評分。這種資料集稱為“稀疏”資料集，因為只有少數單元格有值。

為了解決這一問題，我們按照獲得評分次數最多的電影和對電影評分次數最多的使用者排序。這樣可以形成更“密集”的區域，使我們能夠檢視資料集的頂部資料。

如果我們要選擇獲得評分次數最多的電影和對電影評分次數最多的使用者，則如下所示：

n_movies = 30
n_users = 18
most_rated_movies_users_selection = helper.sort_by_rating_density(user_movie_ratings, n_movies, n_users)

print('dataset dimensions: ', most_rated_movies_users_selection.shape)
most_rated_movies_users_selection.head()

dataset dimensions:  (18, 30)

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

5 rows × 30 columns

這樣更好分析。我們還需要指定一個視覺化這些評分的良好方式，以便在檢視更龐大的子集時能夠直觀地識別這些評分（稍後變成聚類）。

我們使用顏色代替評分數字：

helper.draw_movies_heatmap(most_rated_movies_users_selection)

png

每列表示一部電影。每行表示一位使用者。單元格的顏色根據圖表右側的刻度表示使用者對該電影的評分情況。

注意到某些單元格是白色嗎？表示相應使用者沒有對該電影進行評分。在現實中進行聚類時就會遇到這種問題。與一開始經過整理的示例不同，現實中的資料集經常比較稀疏，資料集中的部分單元格沒有值。這樣的話，直接根據電影評分對使用者進行聚類不太方便，因為 k 均值通常不喜歡缺失值。

為了提高效能，我們將僅使用 1000 部電影的評分（資料集中一共有 9000 部以上）。

user_movie_ratings =  pd.pivot_table(ratings_title, index='userId', columns= 'title', values='rating')
most_rated_movies_1k = helper.get_most_rated_movies(user_movie_ratings, 1000)

為了使 sklearn 對像這樣缺少值的資料集執行 k 均值聚類，我們首先需要將其轉型為稀疏 csr 矩陣型別（如 SciPi 庫中所定義）。

要從 pandas dataframe 轉換為稀疏矩陣，我們需要先轉換為 SparseDataFrame，然後使用 pandas 的 to_coo() 方法進行轉換。

注意：只有較新版本的 pandas 具有to_coo()。如果你在下個單元格中遇到問題，確保你的 pandas 是最新版本。

sparse_ratings = csr_matrix(pd.SparseDataFrame(most_rated_movies_1k).to_coo())

我們來聚類吧！

對於 k 均值，我們需要指定 k，即聚類數量。我們隨意地嘗試 k=20（選擇 k 的更佳方式如上述肘部方法所示。但是，該方法需要一定的執行時間。):

# 20 clusters
predictions = KMeans(n_clusters=20, algorithm='full').fit_predict(sparse_ratings)

為了視覺化其中一些聚類，我們需要將每個聚類繪製成熱圖：

max_users = 70
max_movies = 50

clustered = pd.concat([most_rated_movies_1k.reset_index(), pd.DataFrame({'group':predictions})], axis=1)
helper.draw_movie_clusters(clustered, max_users, max_movies)

cluster # 7
# of users in cluster: 276. # of users in plot: 70

png

cluster # 16
# of users in cluster: 64. # of users in plot: 64

png

cluster # 0
# of users in cluster: 26. # of users in plot: 26

png

cluster # 2
# of users in cluster: 72. # of users in plot: 70

png

cluster # 6
# of users in cluster: 17. # of users in plot: 17

png

cluster # 3
# of users in cluster: 37. # of users in plot: 37

png

cluster # 11
# of users in cluster: 12. # of users in plot: 12

png

cluster # 18
# of users in cluster: 35. # of users in plot: 35

png

cluster # 9
# of users in cluster: 55. # of users in plot: 55

png

cluster # 8
# of users in cluster: 27. # of users in plot: 27

png

cluster # 15
# of users in cluster: 15. # of users in plot: 15

png

需要注意以下幾個事項：

聚類中的評分越相似，你在該聚類中就越能發現顏色相似的垂直線。
在聚類中發現了非常有趣的規律：
某些聚類比其他聚類更稀疏，其中的使用者可能比其他聚類中的使用者看的電影更少，評分的電影也更少。
某些聚類主要是黃色，匯聚了非常喜歡特定型別電影的使用者。其他聚類主要是綠色或海藍色，表示這些使用者都認為某些電影可以評 2-3 顆星。
注意每個聚類中的電影有何變化。圖表對資料進行了過濾，僅顯示評分最多的電影，然後按照平均評分排序。
能找到《指環王》在每個聚類中位於哪個位置嗎？《星球大戰》呢？
很容易發現具有相似顏色的水平線，表示評分變化不大的使用者。這可能是 Netflix 從基於星級的評分切換到喜歡/不喜歡評分的原因之一。四顆星評分對不同的人來說，含義不同。
我們在視覺化聚類時，採取了一些措施（過濾/排序/切片）。因為這種資料集比較“稀疏”，大多數單元格沒有值（因為大部分使用者沒有看過大部分電影）。

預測

我們選擇一個聚類和一位特定的使用者，看看該聚類可以使我們執行哪些實用的操作。

首先選擇一個聚類：

# TODO: Pick a cluster ID from the clusters above
cluster_number = 11

# Let's filter to only see the region of the dataset with the most number of values 
n_users = 75
n_movies = 300
cluster = clustered[clustered.group == cluster_number].drop(['index', 'group'], axis=1)

cluster = helper.sort_by_rating_density(cluster, n_movies, n_users)
helper.draw_movies_heatmap(cluster, axis_labels=False)

png

聚類中的實際評分如下所示：

cluster.fillna('').head()

Amadeus (1984)	Annie Hall (1977)	One Flew Over the Cuckoo's Nest (1975)	Fargo (1996)	Cool Hand Luke (1967)	Chinatown (1974)	North by Northwest (1959)	Citizen Kane (1941)	Wizard of Oz, The (1939)	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	...	Sense and Sensibility (1995)	Top Gun (1986)	Flashdance (1983)	Jerry Maguire (1996)	Superman (1978)	Abyss, The (1989)	Devil in a Blue Dress (1995)
0	5.0	4.0	4.0	5	4	4	4	5	4	...	3	3	3	4	3
1	4.0	4.0	4.0	4	5	5	3	5	4	3	...	2	3	2	4	2	3
2	5.0	4.0	5.0	5	5	5	5	5	5	5	...	3	4	5	4
8	2.0	5.0	2.0	5	3	5	3	4	5	3	...	4.5	2	4	3	3	3
10	3.0	4.0	3.0	4	5	4	4	4	5	...	5	4	3	2

5 rows × 300 columns

從表格中選擇一個空白單元格。因為使用者沒有對該電影評分，所以是空白狀態。能夠預測她是否喜歡該電影嗎？因為該使用者屬於似乎具有相似品位的使用者聚類，我們可以計算該電影在此聚類中的平均評分，結果可以作為她是否喜歡該電影的合理預測依據。

# TODO: Fill in the name of the column/movie. e.g. 'Forrest Gump (1994)'
# Pick a movie from the table above since we're looking at a subset
movie_name = 'Forrest Gump (1994)'

cluster[movie_name].mean()

3.6666666666666665

這就是我們關於她會如何對該電影進行評分的預測。

推薦

我們回顧下上一步的操作。我們使用 k 均值根據使用者的評分對使用者進行聚類。這樣就形成了具有相似評分的使用者聚類，因此通常具有相似的電影品位。基於這一點，當某個使用者對某部電影沒有評分時，我們對該聚類中所有其他使用者的評分取平均值，該平均值就是我們猜測該使用者對該電影的喜歡程度。

根據這一邏輯，如果我們計算該聚類中每部電影的平均分數，就可以判斷該“品位聚類”對資料集中每部電影的喜歡程度。

# The average rating of 20 movies as rated by the users in the cluster
cluster.mean().head(20)

Amadeus (1984)                                                                    3.833333
Annie Hall (1977)                                                                 4.291667
One Flew Over the Cuckoo's Nest (1975)                                            4.208333
Fargo (1996)                                                                      4.454545
Cool Hand Luke (1967)                                                             4.636364
Chinatown (1974)                                                                  4.454545
North by Northwest (1959)                                                         4.409091
Citizen Kane (1941)                                                               4.681818
Wizard of Oz, The (1939)                                                          4.500000
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    4.272727
Butch Cassidy and the Sundance Kid (1969)                                         4.045455
Star Wars: Episode V - The Empire Strikes Back (1980)                             4.090909
Groundhog Day (1993)                                                              3.727273
Gone with the Wind (1939)                                                         4.272727
It's a Wonderful Life (1946)                                                      4.272727
2001: A Space Odyssey (1968)                                                      4.272727
Shawshank Redemption, The (1994)                                                  4.363636
Philadelphia Story, The (1940)                                                    4.409091
Bonnie and Clyde (1967)                                                           4.150000
To Kill a Mockingbird (1962)                                                      4.400000
dtype: float64

這對我們來說變得非常實用，因為現在我們可以使用它作為推薦引擎，使使用者能夠發現他們可能喜歡的電影。

當用戶登入我們的應用時，現在我們可以向他們顯示符合他們的興趣品位的電影。推薦方式是選擇聚類中該使用者尚未評分的最高評分的電影。

# TODO: Pick a user ID from the dataset
# Look at the table above outputted by the command "cluster.fillna('').head()" 
# and pick one of the user ids (the first column in the table)
user_id = 11

# Get all this user's ratings
user_2_ratings  = cluster.loc[user_id, :]

# Which movies did they not rate? (We don't want to recommend movies they've already rated)
user_2_unrated_movies =  user_2_ratings[user_2_ratings.isnull()]

# What are the ratings of these movies the user did not rate?
avg_ratings = pd.concat([user_2_unrated_movies, cluster.mean()], axis=1, join='inner').loc[:,0]

# Let's sort by rating so the highest rated movies are presented first
avg_ratings.sort_values(ascending=False)[:20]

Remains of the Day, The (1993)    4.666667
Saving Private Ryan (1998)        4.642857
African Queen, The (1951)         4.625000
Lone Star (1996)                  4.600000
Godfather: Part II, The (1974)    4.500000
Singin' in the Rain (1952)        4.500000
My Cousin Vinny (1992)            4.500000
Raising Arizona (1987)            4.500000
Fargo (1996)                      4.454545
Rain Man (1988)                   4.400000
Full Metal Jacket (1987)          4.400000
Sense and Sensibility (1995)      4.375000
Fried Green Tomatoes (1991)       4.333333
Room with a View, A (1986)        4.300000
It's a Wonderful Life (1946)      4.272727
Dial M for Murder (1954)          4.250000
Laura (1944)                      4.250000
American Graffiti (1973)          4.250000
Much Ado About

 
 
              
           
              
              
            
            相關推薦
			   
            
            
            
 

    

    
    k-means Clustering of Movie
      
							
							
							電影評分的 k 均值聚類
假設你是 Netflix 的一名資料分析師，你想要根據使用者對不同電影的評分研究使用者在電影品位上的相似和不同之處。瞭解這些評分對使用者電影推薦系統有幫助嗎？我們來研究下這方面的資料。
我們將使用的資料來自精彩的 MovieLens 使 

  
 

    

    
    基於K-means Clustering聚類算法對電商商戶進行級別劃分(含Octave仿真)
      fprintf   highlight   初始   load   ogre   max   init   金額   定時   在從事電商做頻道運營時，每到關鍵時間節點，大促前，季度末等等，我們要做的一件事情就是品牌池打分，更新所有店鋪的等級。例如，所以的商戶分入SKA,KA,普通店鋪,新店鋪這4個級別，對於 

  
 

    

    
    K-means clustering is not a free lunch
       
 
  
  K-means clustering is not a free lunch 
  
  
  I recently came across this question on Cross Validated, and I thought it offered a great opp 

  
 

    

    
    Coursera-吳恩達-機器學習-第八週-程式設計作業: K-Means Clustering and PCA
       
 
 本次文章內容： Coursera吳恩達機器學習課程，第八週程式設計作業。程式語言是Matlab。 
 本文只是從程式碼結構上做的小筆記，更復雜的推導不在這裡。演算法分兩部分進行理解，第一部分是根據code對演算法進行綜述，第二部分是程式碼。 
 本次作業分兩個part，第一個是K-Means Clu 

  
 

    

    
    機器學習 聚類(Clustering)____K-均值聚類演算法(K-means Clustering) 層次聚類(Hierarchical Clustering)
      
                ____tz_zs學習筆記聚類(Clustering) 顧名思義，就是將相似樣本聚合在一起，屬於機器學習中的非監督學習 (unsupervised learning) 問題。聚類的目標是找到相近的資料點，並將相近的資料點聚合在一起。實現聚類的演算法主要有：1.K-均值聚類演算 

  
 

    

    
    K均值聚類（K-means）和高斯混合聚類（Mixture of Gaussian Models）
      math   del   一個   ans   line   k-均值聚類   初始化   gaussian   樣本   K-means算法流程
給定條件：
????example set: \((x_1, y_1), (x_2, y_2), \dots, (x_N, y_N)\)
初始化：
????K個簇 

  
 

    

    
    時間序列聚類演算法-《k-Shape: Efficient and Accurate Clustering of Time Series》解讀
      
                摘要

本文提出了一個新穎的時間序列聚類演算法k-shape，該演算法的核心是迭代增強過程，可以生成同質且較好分離的聚類。該演算法採用標準的互相關距離衡量方法，基於此距離衡量方法的特性，提出了一個計算簇心的方法，在每一次迭代中都用它來更新時間序列的聚類分配。作者通過大量和具有 

  
 

    

    
    Clustering electricity usage profiles with K-means
      As we can see, K-means found three unique groups of load-profiles.The green cluster contains loads that maintain a steady use of energy throughout the afte 

  
 

    

    
    機器學習演算法原理總結系列---演算法基礎之(11)聚類K均值(Clustering K-means）
      
							
							
							一、原理詳解


歸類：  
      聚類(clustering) 屬於非監督學習 (unsupervised learning) 
      無類別標記(class label)
舉例： 

K-means 演算法： 
 3.1 Clustering 中 

  
 

    

    
    Mahout clustering Canopy+K-means 原始碼分析
      
                

聚類分析


     聚類（Clustering）可以簡單的理解為將資料物件分為多個簇（Cluster），每個簇裡的所有資料物件具有一定的相似性，這樣一個簇可以看作一個整體，以此可以提高計算質量或減少計算量。而資料物件間相似性的衡量通常是通過座標系中空間距離的大小來判斷 

  
 

    

    
    Bag-Of-Words中K-Means聚類的效率優化
      
                


        最初的Bag of words，也叫做“詞袋”，在資訊檢索中，Bag of words model假定對於一個文字，忽略其詞序和語法，句法，將其僅僅看做是一個詞集合，或者說是詞的一個組合，文字中每個詞的出現都是獨立的，不依賴於其他詞 是否出現，或者說當這 

  
 

    

    
    視覺SLAM之詞袋（bag of words） 模型與K-means聚類演算法淺析（1）
      
第一步：利用SIFT演算法從不同類別的影象中提取視覺詞彙向量，這些向量代表的是影象中區域性不變的特徵點；
第二步：將所有特徵點向量集合到一塊，利用K-Means演算法合併詞義相近的視覺詞彙，構造一個包含K個詞彙的單詞表；
第三步：統計單詞表中每個單詞在影象中出現的次數，從而將影象表示成為一個K維數值向量。
 

  
 

    

    
    聚類演算法（K-means + Fuzzy C-means + Hierarchical + Mixture of Gaussians）---第一部分：簡介
      
                
前言
什麼是聚類？ 聚類可以被認為是最重要的無監督學習問題; 所以，像這樣的其他問題一樣，它涉及在未標記資料的集合中找到一個結構。聚類的簡單定義可能是“將物件組織成某些成員相似的組的過程”。因此，"cluster"是它們之間“相似”的物件的集合，並且與屬於其他"cluste 

  
 

    

    
    視覺SLAM之詞袋（bag of words） 模型與K-means聚類演算法淺析（2）
      
 
        
這裡表示分類的平均值。
注：arg表示使目標函式取最小值時的變數值
設我們一共有 N 個數據點需要分為 K 個 cluster ，k-means 要做的就是最小化


其中  rnk在  在資料點 n 被歸類到 cluster k 的時候為 1 ，否則為 0 。直接尋找  rnk  和 

  
 

    

    
    機器學習——聚類（clustering）：K-means演算法（非監督學習）
      
                
1、歸類
聚類（clustering）：屬於非監督學習（unsupervised learning）,是無類別標記（class label）
2、舉例

3、K-means演算法
（1）K-means演算法是聚類（clustering）中的經典演算法，資料探勘的十大經典演算 

  
 

    

    
    8.2 聚類(Clustering) K-means演算法應用
      
							
							
							1.python實現kmean 演算法



import numpy as np

def kmeans(X,k,maxIt):
    numPoints,numDim=X.shape

    dataSet=np.zeros((numPoints,num 

  
 

    

    
    幾種計算機視覺中常用的聚類演算法（K-means, Agglomerative clustering, Mean shift, Spectral clustering）
      
                



對於機器學習而言，聚類常常應用於離散情況下的非監督學習演算法之中，如下圖所示。

聚類的總體目標是使得最小化目標函式：

聚類的方法很多，常用的包括（K-means, Agglomerative clustering,mean shift, Spectral clus 

  
 

    

    
    機器學習文本挖掘之spherical k-means algorithm初識
      ise   align   lar   global   product   com   matrix   ati   hms   Spherical K-Means 法によるクラスタ分析の実験検証
 
1.1研究背景、目的：
インターネットの普及などにより、膨大なデータの中からデータ間の関係を見つけ出したり 

  
 

    

    
    軟件——機器學習與Python，聚類，K——means
      4.0   etc   sichuan   readlines   文件夾   8.4   k-means   數據   -m    K-means是一種聚類算法：
這裏運用k-means進行31個城市的分類
城市的數據保存在city.txt文件中，內容如下：
BJ,2959.19,730.79,749.41 

  
 

    

    
    mahout in Action2.2-聚類介紹-K-means聚類算法
      過程   swing   浪漫   res   cto   等等   算法   結合   -m   


聚類介紹


本章包含

    1 實戰操作了解聚類
    2.了解相似性概念
    3 使用mahout執行一個簡單的聚類實例
    4.用於聚類的各種不同的

k-means Clustering of Movie

電影評分的 k 均值聚類

資料集概述

愛情片與科幻片

選擇 K

再加入動作片型別

電影級別的聚類

我們來聚類吧！

預測

推薦

k-means Clustering of Movie

基於K-means Clustering聚類算法對電商商戶進行級別劃分(含Octave仿真)

K-means clustering is not a free lunch

Coursera-吳恩達-機器學習-第八週-程式設計作業: K-Means Clustering and PCA

機器學習聚類(Clustering)____K-均值聚類演算法(K-means Clustering) 層次聚類(Hierarchical Clustering)

K均值聚類（K-means）和高斯混合聚類（Mixture of Gaussian Models）

時間序列聚類演算法-《k-Shape: Efficient and Accurate Clustering of Time Series》解讀

Clustering electricity usage profiles with K-means

機器學習演算法原理總結系列---演算法基礎之(11)聚類K均值(Clustering K-means）

Mahout clustering Canopy+K-means 原始碼分析

Bag-Of-Words中K-Means聚類的效率優化

視覺SLAM之詞袋（bag of words）模型與K-means聚類演算法淺析（1）

聚類演算法（K-means + Fuzzy C-means + Hierarchical + Mixture of Gaussians）---第一部分：簡介

視覺SLAM之詞袋（bag of words）模型與K-means聚類演算法淺析（2）

機器學習——聚類（clustering）：K-means演算法（非監督學習）

8.2 聚類(Clustering) K-means演算法應用

幾種計算機視覺中常用的聚類演算法（K-means, Agglomerative clustering, Mean shift, Spectral clustering）

機器學習文本挖掘之spherical k-means algorithm初識

軟件——機器學習與Python，聚類，K——means

mahout in Action2.2-聚類介紹-K-means聚類算法

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

Amadeus (1984)	Annie Hall (1977)	One Flew Over the Cuckoo's Nest (1975)	Fargo (1996)	Cool Hand Luke (1967)	Chinatown (1974)	North by Northwest (1959)	Citizen Kane (1941)	Wizard of Oz, The (1939)	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	...	Sense and Sensibility (1995)	Top Gun (1986)	Flashdance (1983)	Jerry Maguire (1996)	Superman (1978)	Abyss, The (1989)	Devil in a Blue Dress (1995)
0	5.0	4.0	4.0	5	4	4	4	5	4	...	3	3	3	4	3
1	4.0	4.0	4.0	4	5	5	3	5	4	3	...	2	3	2	4	2	3
2	5.0	4.0	5.0	5	5	5	5	5	5	5	...	3	4	5	4
8	2.0	5.0	2.0	5	3	5	3	4	5	3	...	4.5	2	4	3	3	3
10	3.0	4.0	3.0	4	5	4	4	4	5	...	5	4	3	2

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

Amadeus (1984)	Annie Hall (1977)	One Flew Over the Cuckoo's Nest (1975)	Fargo (1996)	Cool Hand Luke (1967)	Chinatown (1974)	North by Northwest (1959)	Citizen Kane (1941)	Wizard of Oz, The (1939)	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	...	Sense and Sensibility (1995)	Top Gun (1986)	Flashdance (1983)	Jerry Maguire (1996)	Superman (1978)	Abyss, The (1989)	Devil in a Blue Dress (1995)
0	5.0	4.0	4.0	5	4	4	4	5	4	...	3	3	3	4	3
1	4.0	4.0	4.0	4	5	5	3	5	4	3	...	2	3	2	4	2	3
2	5.0	4.0	5.0	5	5	5	5	5	5	5	...	3	4	5	4
8	2.0	5.0	2.0	5	3	5	3	4	5	3	...	4.5	2	4	3	3	3
10	3.0	4.0	3.0	4	5	4	4	4	5	...	5	4	3	2

title	Forrest Gump (1994)	Pulp Fiction (1994)	Shawshank Redemption, The (1994)	Silence of the Lambs, The (1991)	Star Wars: Episode IV - A New Hope (1977)	Jurassic Park (1993)	Matrix, The (1999)	Toy Story (1995)	Schindler's List (1993)	Terminator 2: Judgment Day (1991)	...	Dances with Wolves (1990)	Fight Club (1999)	Usual Suspects, The (1995)	Seven (a.k.a. Se7en) (1995)	Lion King, The (1994)	Godfather, The (1972)	Lord of the Rings: The Fellowship of the Ring, The (2001)	Apollo 13 (1995)	True Lies (1994)	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)
29	5.0	5.0	5.0	4.0	4.0	4.0	3.0	4.0	5.0	4.0	...	5.0	4.0	5.0	4.0	3.0	5.0	3.0	5.0	4.0	2.0
508	4.0	5.0	4.0	4.0	5.0	3.0	4.5	3.0	5.0	2.0	...	5.0	4.0	5.0	4.0	3.5	5.0	4.5	3.0	2.0	4.0
14	1.0	5.0	2.0	5.0	5.0	3.0	5.0	2.0	4.0	4.0	...	3.0	5.0	5.0	5.0	4.0	5.0	5.0	3.0	4.0	4.0
72	5.0	5.0	5.0	4.5	4.5	4.0	4.5	5.0	5.0	3.0	...	4.5	5.0	5.0	5.0	5.0	5.0	5.0	3.5	3.0	5.0
653	4.0	5.0	5.0	4.5	5.0	4.5	5.0	5.0	5.0	5.0	...	4.5	5.0	5.0	4.5	5.0	4.5	5.0	5.0	4.0	5.0

Amadeus (1984)	Annie Hall (1977)	One Flew Over the Cuckoo's Nest (1975)	Fargo (1996)	Cool Hand Luke (1967)	Chinatown (1974)	North by Northwest (1959)	Citizen Kane (1941)	Wizard of Oz, The (1939)	Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)	...	Sense and Sensibility (1995)	Top Gun (1986)	Flashdance (1983)	Jerry Maguire (1996)	Superman (1978)	Abyss, The (1989)	Devil in a Blue Dress (1995)
0	5.0	4.0	4.0	5	4	4	4	5	4	...	3	3	3	4	3
1	4.0	4.0	4.0	4	5	5	3	5	4	3	...	2	3	2	4	2	3
2	5.0	4.0	5.0	5	5	5	5	5	5	5	...	3	4	5	4
8	2.0	5.0	2.0	5	3	5	3	4	5	3	...	4.5	2	4	3	3	3
10	3.0	4.0	3.0	4	5	4	4	4	5	...	5	4	3	2