1. 程式人生 > >推薦系統案例

推薦系統案例

摘要

本文將介紹如下幾種推薦演算法以及調優過程

1.基線演算法baseline

2.item協同過濾

3. 結合基線演算法baseline的item協同過濾演算法

4. item協同過濾(topK+ baseline)

電影資料集地址:

http://files.grouplens.org/datasets/movielens/ml-100k.zip

基線演算法baseline

baseline演算法的主要原理:使用公式item_mean+ user_mean[user] - all_mean填充使用者評分矩陣Nan值預測使用者對未知item的評分,其中item_mean是所有使用者對指定item的評分平均值,user_mean是指定使用者又有定影評分的平均值,all_mean則是所有item的評分平均值

首先看下測試資料的結構[user_id,movie_id,rating,timestamp]

1	1	5	874965758
1	2	3	876893171
1	3	4	878542960

用pandas讀入資料

import numpy as np
import pandas as pd
title=['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u1.base",sep='\t',names = title)

檢視user和item去重後的個數

print np.max(df['user_id']),np.max(df['item_id'])
943 1682

構造評分矩陣ratings

ratings = np.zeros((943, 1682))
for row in df.itertuples():
    ratings[row[1]-1,row[2]-1] = row[3]

檢視評分矩陣稠密度

sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('訓練集矩陣密度為: {:4.2f}%'.format(sparsity))
訓練集矩陣密度為: 5.04%

可以看出來評分矩陣是個非常稀疏的矩陣,95%的資料都是空值

開始baseline演算法,首先要計算的是item_mean,user_mean, all_mean

all_mean = np.mean(ratings[ratings!=0])
user_mean = sum(ratings.T)/sum((ratings!=0).T)
item_mean = sum(ratings)/sum((ratings!=0))
#用all_mean填充user_mean和item_mean可能存在的空值Nan
user_mean = np.where(np.isnan(user_mean), all_mean, user_mean)
item_mean = np.where(np.isnan(item_mean), all_mean, item_mean)

預測使用者user對item的評分

def predict_naive(user, item):
    prediction = item_mean[item] + user_mean[user] - all_mean
    return prediction

用均方根誤差衡量演算法準確率

def rmse(pred, actual):
    '''計算預測結果的rmse'''
    from sklearn.metrics import mean_squared_error
    pred = pred[actual.nonzero()].flatten()
    actual = actual[actual.nonzero()].flatten()
    return np.sqrt(mean_squared_error(pred, actual))

用測試集測試演算法

# 用測試集測試
for row in test_df.itertuples():
    user,item,actual = row[1]-1,row[2]-1,row[3]
    predictions.append(predict_naive(user, item))
    actuals.append(actual)
print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(actuals)))

測試結果的rmse為 0.9344

item協同過濾

# 計算item和user相似度矩陣
user_s = ratings.dot(ratings.T)
item_s = ratings.T.dot(ratings)
user_norm = np.array([np.sqrt(np.diagonal(user_s))])
item_norm = np.array([np.sqrt(np.diagonal(item_s))])
user_sim = (user_s/user_norm/user_norm.T)
item_sim = (item_s/item_norm/item_norm.T)
print np.round_(item_sim[:10,:10], 3)

[[ 1.     0.296  0.279  0.388  0.252  0.114  0.518  0.41   0.416  0.199]
 [ 0.296  1.     0.177  0.405  0.211  0.099  0.331  0.31   0.207  0.152]
 [ 0.279  0.177  1.     0.275  0.118  0.104  0.311  0.125  0.207  0.121]
 [ 0.388  0.405  0.275  1.     0.265  0.091  0.411  0.391  0.357  0.219]
 [ 0.252  0.211  0.118  0.265  1.     0.016  0.28   0.214  0.202  0.031]
 [ 0.114  0.099  0.104  0.091  0.016  1.     0.128  0.065  0.164  0.139]
 [ 0.518  0.331  0.311  0.411  0.28   0.128  1.     0.342  0.43   0.279]
 [ 0.41   0.31   0.125  0.391  0.214  0.065  0.342  1.     0.364  0.166]
 [ 0.416  0.207  0.207  0.357  0.202  0.164  0.43   0.364  1.     0.25 ]
 [ 0.199  0.152  0.121  0.219  0.031  0.139  0.279  0.166  0.25   1.   ]]

評分預測方法

def predict_itemCF(user, item, k=100):
    '''item協同過濾演算法,預測rating'''
    nzero = ratings[user].nonzero()[0]
    prediction = ratings[user, nzero].dot(item_sim[item, nzero])\
                / sum(item_sim[item, nzero])
    return prediction

測試預測結果

test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用item-based協同過濾演算法進行預測...')
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_itemCF(user, item))
    targets.append(actual)

print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))

測試集大小為 20000
採用item-based協同過濾演算法進行預測...
測試結果的rmse為 0.9534

結合基線演算法baseline的item協同過濾演算法

def predict_itemCF_baseline(user, item):
    '''結合baseline的item-basedCF演算法,預測rating'''
    nzero = ratings[user].nonzero()[0]
    baseline = item_mean + user_mean[user] - all_mean
    prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
                / sum(item_sim[item, nzero]) + baseline[item]
    return prediction 

test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用結合baseline的item-item協同過濾演算法進行預測...')
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_itemCF_baseline(user, item))
    targets.append(actual)

print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))

測試集大小為 20000
採用item-based協同過濾演算法進行預測...
測試結果的rmse為 0.8794

修正非法評分,將預測評分大於5的取值5,小於1的評分取值1

def predict_itemCF_baseline(user, item, k=100):
    '''結合基線演算法的item-based CF演算法,預測rating'''
    nzero = ratings[user].nonzero()[0]
    baseline = item_mean + user_mean[user] - all_mean
    prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
                / sum(item_sim[item, nzero]) + baseline[item]
    if prediction > 5:
        prediction = 5
    if prediction < 1:
        prediciton = 1
    return prediction

test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用結合baseline的item-item協同過濾演算法進行預測...')
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_biasCF(user, item))
    targets.append(actual)

print('修正評分後的測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))


測試集大小為 20000
採用結合baseline的item-item協同過濾演算法進行預測...
修正評分後的測試結果的rmse為 0.8793

 item協同過濾(topK+ baseline)

print('------ Top-k協同過濾(item-based + baseline)------')
def predict_topkCF(user, item, k=10):
    '''top-k CF演算法,以item-based協同過濾為基礎,結合baseline,預測rating'''
    nzero = ratings[user].nonzero()[0]
    baseline = item_mean + user_mean[user] - all_mean
    choice = nzero[item_sim[item, nzero].argsort()[::-1][:k]]
    prediction = (ratings[user, choice] - baseline[choice]).dot(item_sim[item, choice])\
                / sum(item_sim[item, choice]) + baseline[item]
    if prediction > 5: prediction = 5
    if prediction < 1: prediction = 1
    return prediction 

print('載入測試集...')
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用top K協同過濾演算法進行預測...')
k = 20
print('選取的K值為%d.' % k)
for row in test_df.itertuples():
    user, item, actual = row[1]-1, row[2]-1, row[3]
    predictions.append(predict_topkCF(user, item, k))
    targets.append(actual)
print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))

------ Top-k協同過濾(item-based + baseline)------
載入測試集...
測試集大小為 20000
採用top K協同過濾演算法進行預測...
選取的K值為20.
測試結果的rmse為 0.7799