推薦系統案例
阿新 • • 發佈:2019-01-09
摘要
本文將介紹如下幾種推薦演算法以及調優過程
1.基線演算法baseline
2.item協同過濾
3. 結合基線演算法baseline的item協同過濾演算法
4. item協同過濾(topK+ baseline)
電影資料集地址:
http://files.grouplens.org/datasets/movielens/ml-100k.zip
基線演算法baseline
baseline演算法的主要原理:使用公式item_mean+ user_mean[user] - all_mean填充使用者評分矩陣Nan值預測使用者對未知item的評分,其中item_mean是所有使用者對指定item的評分平均值,user_mean是指定使用者又有定影評分的平均值,all_mean則是所有item的評分平均值
首先看下測試資料的結構[user_id,movie_id,rating,timestamp]
1 1 5 874965758
1 2 3 876893171
1 3 4 878542960
用pandas讀入資料
import numpy as np
import pandas as pd
title=['user_id', 'item_id', 'rating', 'timestamp']
df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u1.base",sep='\t',names = title)
檢視user和item去重後的個數
print np.max(df['user_id']),np.max(df['item_id']) 943 1682
構造評分矩陣ratings
ratings = np.zeros((943, 1682))
for row in df.itertuples():
ratings[row[1]-1,row[2]-1] = row[3]
檢視評分矩陣稠密度
sparsity = float(len(ratings.nonzero()[0]))
sparsity /= (ratings.shape[0] * ratings.shape[1])
sparsity *= 100
print('訓練集矩陣密度為: {:4.2f}%'.format(sparsity))
訓練集矩陣密度為: 5.04%
可以看出來評分矩陣是個非常稀疏的矩陣,95%的資料都是空值
開始baseline演算法,首先要計算的是item_mean,user_mean, all_mean
all_mean = np.mean(ratings[ratings!=0])
user_mean = sum(ratings.T)/sum((ratings!=0).T)
item_mean = sum(ratings)/sum((ratings!=0))
#用all_mean填充user_mean和item_mean可能存在的空值Nan
user_mean = np.where(np.isnan(user_mean), all_mean, user_mean)
item_mean = np.where(np.isnan(item_mean), all_mean, item_mean)
預測使用者user對item的評分
def predict_naive(user, item):
prediction = item_mean[item] + user_mean[user] - all_mean
return prediction
用均方根誤差衡量演算法準確率
def rmse(pred, actual):
'''計算預測結果的rmse'''
from sklearn.metrics import mean_squared_error
pred = pred[actual.nonzero()].flatten()
actual = actual[actual.nonzero()].flatten()
return np.sqrt(mean_squared_error(pred, actual))
用測試集測試演算法
# 用測試集測試
for row in test_df.itertuples():
user,item,actual = row[1]-1,row[2]-1,row[3]
predictions.append(predict_naive(user, item))
actuals.append(actual)
print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(actuals)))
測試結果的rmse為 0.9344
item協同過濾
# 計算item和user相似度矩陣
user_s = ratings.dot(ratings.T)
item_s = ratings.T.dot(ratings)
user_norm = np.array([np.sqrt(np.diagonal(user_s))])
item_norm = np.array([np.sqrt(np.diagonal(item_s))])
user_sim = (user_s/user_norm/user_norm.T)
item_sim = (item_s/item_norm/item_norm.T)
print np.round_(item_sim[:10,:10], 3)
[[ 1. 0.296 0.279 0.388 0.252 0.114 0.518 0.41 0.416 0.199]
[ 0.296 1. 0.177 0.405 0.211 0.099 0.331 0.31 0.207 0.152]
[ 0.279 0.177 1. 0.275 0.118 0.104 0.311 0.125 0.207 0.121]
[ 0.388 0.405 0.275 1. 0.265 0.091 0.411 0.391 0.357 0.219]
[ 0.252 0.211 0.118 0.265 1. 0.016 0.28 0.214 0.202 0.031]
[ 0.114 0.099 0.104 0.091 0.016 1. 0.128 0.065 0.164 0.139]
[ 0.518 0.331 0.311 0.411 0.28 0.128 1. 0.342 0.43 0.279]
[ 0.41 0.31 0.125 0.391 0.214 0.065 0.342 1. 0.364 0.166]
[ 0.416 0.207 0.207 0.357 0.202 0.164 0.43 0.364 1. 0.25 ]
[ 0.199 0.152 0.121 0.219 0.031 0.139 0.279 0.166 0.25 1. ]]
評分預測方法
def predict_itemCF(user, item, k=100):
'''item協同過濾演算法,預測rating'''
nzero = ratings[user].nonzero()[0]
prediction = ratings[user, nzero].dot(item_sim[item, nzero])\
/ sum(item_sim[item, nzero])
return prediction
測試預測結果
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用item-based協同過濾演算法進行預測...')
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_itemCF(user, item))
targets.append(actual)
print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))
測試集大小為 20000
採用item-based協同過濾演算法進行預測...
測試結果的rmse為 0.9534
結合基線演算法baseline的item協同過濾演算法
def predict_itemCF_baseline(user, item):
'''結合baseline的item-basedCF演算法,預測rating'''
nzero = ratings[user].nonzero()[0]
baseline = item_mean + user_mean[user] - all_mean
prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
/ sum(item_sim[item, nzero]) + baseline[item]
return prediction
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用結合baseline的item-item協同過濾演算法進行預測...')
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_itemCF_baseline(user, item))
targets.append(actual)
print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))
測試集大小為 20000
採用item-based協同過濾演算法進行預測...
測試結果的rmse為 0.8794
修正非法評分,將預測評分大於5的取值5,小於1的評分取值1
def predict_itemCF_baseline(user, item, k=100):
'''結合基線演算法的item-based CF演算法,預測rating'''
nzero = ratings[user].nonzero()[0]
baseline = item_mean + user_mean[user] - all_mean
prediction = (ratings[user, nzero] - baseline[nzero]).dot(item_sim[item, nzero])\
/ sum(item_sim[item, nzero]) + baseline[item]
if prediction > 5:
prediction = 5
if prediction < 1:
prediciton = 1
return prediction
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用結合baseline的item-item協同過濾演算法進行預測...')
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_biasCF(user, item))
targets.append(actual)
print('修正評分後的測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))
測試集大小為 20000
採用結合baseline的item-item協同過濾演算法進行預測...
修正評分後的測試結果的rmse為 0.8793
item協同過濾(topK+ baseline)
print('------ Top-k協同過濾(item-based + baseline)------')
def predict_topkCF(user, item, k=10):
'''top-k CF演算法,以item-based協同過濾為基礎,結合baseline,預測rating'''
nzero = ratings[user].nonzero()[0]
baseline = item_mean + user_mean[user] - all_mean
choice = nzero[item_sim[item, nzero].argsort()[::-1][:k]]
prediction = (ratings[user, choice] - baseline[choice]).dot(item_sim[item, choice])\
/ sum(item_sim[item, choice]) + baseline[item]
if prediction > 5: prediction = 5
if prediction < 1: prediction = 1
return prediction
print('載入測試集...')
test_df = pd.read_csv("C://Users/Administrator/Desktop/ml-100k/u3.test", sep='\t', names=title)
test_df.head()
predictions = []
targets = []
print('測試集大小為 %d' % len(test_df))
print('採用top K協同過濾演算法進行預測...')
k = 20
print('選取的K值為%d.' % k)
for row in test_df.itertuples():
user, item, actual = row[1]-1, row[2]-1, row[3]
predictions.append(predict_topkCF(user, item, k))
targets.append(actual)
print('測試結果的rmse為 %.4f' % rmse(np.array(predictions), np.array(targets)))
------ Top-k協同過濾(item-based + baseline)------
載入測試集...
測試集大小為 20000
採用top K協同過濾演算法進行預測...
選取的K值為20.
測試結果的rmse為 0.7799