1. 程式人生 > >推薦算法之電影推薦

推薦算法之電影推薦

ons lar ocs 用戶id tails test 有用 game cross

兩種推薦算法的實現

1.基於鄰域的方法(協同過濾)(collaborative filtering): user-based, item-based。

2.基於隱語義的方法(矩陣分解):SVD。

使用python推薦系統庫surprise。

surprise是scikit系列中的一個,簡單易用,同時支持多種推薦算法:基礎算法、協同過濾算法、矩陣分解(隱語義模型)。

surprise文檔: https://surprise.readthedocs.io/en/stable/getting_started.html

import os, io, collections
import pandas as pd
from surprise import Dataset, KNNBaseline, SVD, accuracy, Reader from surprise.model_selection import cross_validate, train_test_split # 協同過濾方法 # 載入movielens-100k數據集,一個經典的公開推薦系統數據集,有選項提示是否下載。 data = Dataset.load_builtin(ml-100k) # 或載入本地數據集# 數據集路徑path to dataset filefile_path = os.path.expanduser(‘~/.surprise_data/ml-100k/ml-100k/u.data‘)
# 使用Reader指定文本格式,參數line_format指定特征(列名),參數sep指定分隔符reader = Reader(line_format=‘user item rating timestamp‘, sep=‘\t‘)# 加載數據集data = Dataset.load_from_file(file_path, reader=reader) data_df = pd.read_csv(file_path, sep=\t, header=None, names=[user,item,rating,timestamp]) item_df = pd.read_csv(os.path.expanduser(
~/.surprise_data/ml-100k/ml-100k/u.item), sep=|, encoding=ISO-8859-1, header=None, names=[mid,mtitle]+[x for x in range(22)]) # 每列都轉換為字符串類型 data_df = data_df.astype(str)
item_df = item_df.astype(str) # 電影id到電影標題的映射 item_dict = { item_df.loc[x, mid]: item_df.loc[x, mtitle] for x in range(len(item_df)) }

數據集說明:1997-9-19到1998-4-22,在七個月內從電影網站movielens.umn.edu收集而來。

查看數據集

root@c:~$ cd ~/.surprise_data/ml-100k/ml-100k
root@c:ml-100k$ ls
allbut.pl  u1.base  u2.test  u4.base  u5.test  ub.base  u.genre  u.occupation
mku.sh     u1.test  u3.base  u4.test  ua.base  ub.test  u.info   u.user
README     u2.base  u3.test  u5.base  ua.test  u.data   u.item

其中比較重要的文件有:u.data, u.item。

u.data包含用戶對電影的100000個評分,共943位用戶,1682部電影,每位用戶至少對20部電影進行了評分,每一列分別為用戶id,電影id,評分,時間戳。

1 root@c:ml-100k$ sed -n 1,5p u.data
2 196     242     3       881250949
3 186     302     3       891717742
4 22      377     1       878887116
5 244     51      2       880606923
6 166     346     1       886397596

u.item包含電影的具體信息,前兩列分別是電影id和電影標題。

1 root@c:ml-100k$ sed -n 1,5p u.item
2 1|Toy Story (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)|0|0|0|1|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0
3 2|GoldenEye (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?GoldenEye%20(1995)|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
4 3|Four Rooms (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
5 4|Get Shorty (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)|0|1|0|0|0|1|0|0|1|0|0|0|0|0|0|0|0|0|0
6 5|Copycat (1995)|01-Jan-1995||http://us.imdb.com/M/title-exact?Copycat%20(1995)|0|0|0|0|0|0|1|0|1|0|0|0|0|0|0|0|1|0|0

基於用戶的協同過濾算法:

# 使用協同過濾算法時的相似性度量配置
# user-based
user_based_sim_option = {name: pearson_baseline, user_based: True}
# item-based
item_based_sim_option = {name: pearson_baseline, user_based: False}

# 為用戶推薦n部電影,基於用戶的協同過濾算法,先獲取10個相似度最高的用戶,把這些用戶評分高的電影加入推薦列表。
def get_similar_users_recommendations(uid, n=10):
    # 獲取訓練集,這裏取數據集全部數據
    trainset = data.build_full_trainset()
    # 考慮基線評級的協同過濾算法
    algo = KNNBaseline(sim_option = user_based_sim_option)
    # 擬合訓練集
    algo.fit(trainset)
    # 將原始id轉換為內部id
    inner_id = algo.trainset.to_inner_uid(uid)
    # 使用get_neighbors方法得到10個最相似的用戶
    neighbors = algo.get_neighbors(inner_id, k=10)
    neighbors_uid = ( algo.trainset.to_raw_uid(x) for x in neighbors )
    recommendations = set()
    #把評分為5的電影加入推薦列表
    for user in neighbors_uid:
        if len(recommendations) > n:
            break
        item = data_df[data_df[user]==user]
        item = item[item[rating]==5][item]
        for i in item:
            recommendations.add(item_dict[i])
    print(\nrecommendations for user %s:)
    for i, j in enumerate(list(recommendations)):
        if i >= 10:
            break
        print(j)

給id為1的用戶推薦10部電影:

 1 In []: get_similar_users_recommendations(1, 10)
 2 Out[]: Estimating biases using als...
 3        Computing the msd similarity matrix...
 4        Done computing similarity matrix.
 5        
 6        recommendations for user %s:
 7        Lawrence of Arabia (1962)
 8        Full Monty, The (1997)
 9        Winter Guest, The (1997)
10        Air Force One (1997)
11        Hoop Dreams (1994)
12        Game, The (1997)
13        English Patient, The (1996)
14        Mrs. Brown (Her Majesty, Mrs. Brown) (1997)
15        Contact (1997)
16        Liar Liar (1997)

基於物品的協同過濾算法:

# 與某電影相似度最高的n部電影,基於物品的協同過濾算法。
def get_similar_items(iid, n = 10):
    trainset = data.build_full_trainset()
    algo = KNNBaseline(sim_option = item_based_sim_option)
    algo.fit(trainset)
    inner_id = algo.trainset.to_inner_iid(iid)
    # 使用get_neighbors方法得到n個最相似的電影
    neighbors = algo.get_neighbors(inner_id, k=n)
    neighbors_iid = ( algo.trainset.to_raw_iid(x) for x in neighbors )
    recommendations = [ item_dict[x] for x in neighbors_iid ]
    print(\nten movies most similar to the %s: % item_dict[iid])
    for i in recommendations:
        print(i)

與id為2的電影(GoldenEye (1995))相似度最高的十部電影:

 1 In []: get_similar_items(2)
 2 Out[]: Estimating biases using als...
 3        Computing the msd similarity matrix...
 4        Done computing similarity matrix.
 5 
 6        ten movies most similar to the GoldenEye (1995):
 7        Evil Dead II (1987)
 8        Hoop Dreams (1994)
 9        Speed (1994)
10        Grand Day Out, A (1992)
11        Ed Wood (1994)
12        Adventures of Pinocchio, The (1996)
13        Highlander (1986)
14        Unforgiven (1992)
15        Down Periscope (1996)
16        Bullets Over Broadway (1994)

矩陣分解算法SVD:

# SVD算法,預測所有用戶的電影的評分,把每個用戶評分最高的n部電影加入字典。
def get_recommendations_dict(n = 10):
    trainset = data.build_full_trainset()
    # 測試集,所有未評分的值
    testset = trainset.build_anti_testset()
    # 使用SVD算法
    algo = SVD()
    algo.fit(trainset)
    # 預測
    predictions = algo.test(testset)
    # 均方根誤差
    print("RMSE: %s" % accuracy.rmse(predictions))

    # 字典保存每個用戶評分最高的十部電影
    user_recommendations = collections.defaultdict(list)
    for uid, iid, r_ui, est, details in predictions:
        user_recommendations[uid].append((iid, est))
    for uid, user_ratings in user_recommendations.items():
        user_ratings.sort(key = lambda x: x[1], reverse=True)
        user_recommendations[uid] = user_ratings[:n]
    return user_recommendations

# 獲取每個用戶評分最高的10部電影
user_recommendations = get_recommendations_dict(10)

# 顯示為用戶推薦的電影名
def rec_for_user(uid):
    print("recommendations for user %s:" % uid)
    #[ item_dict[x[0]] for x in user_recommendations[uid] ]
    for i in user_recommendations[uid]:
        print(item_dict[i[0]])

給id為1的用戶推薦10部電影:

 1 In []: rec_for_user(1) 
 2 Out[]: recommendations for user 1:
 3        L.A. Confidential (1997)
 4        Secrets & Lies (1996)
 5        Ran (1985)
 6        Lawrence of Arabia (1962)
 7        One Flew Over the Cuckoos Nest (1975)
 8        Raise the Red Lantern (1991)
 9        In the Name of the Father (1993)
10        City of Lost Children, The (1995)
11        Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)
12        Faust (1994)

推薦算法之電影推薦