DataCastle[猜你喜歡]推薦系統競賽——Kuhung思路及程式碼
概況介紹
我是參加DataCastle[猜你喜歡]推薦系統的kuhung。在截止競賽日期的測評榜中,我的團隊——猜你不喜歡,以7.86565的最終成績,位居第二。接下來我將分享我的比賽心得及才賽程式碼。
賽題感想
在本次比賽中,DataCastle提供了一個商品網站中大約16萬名使用者在四年內對商品的評分資料,每條評分記錄都有時間戳(隱匿了具體時間,只保證順序不變)。評分分為5級,1分最低,5分最高。
抽取了超過3400萬條評分記錄,作為訓練集,資料檔名為train.csv,欄位格式為:
uid,iid,score,time
使用者i,商品a,評分,相對時間
使用者j,商品b,評分,相對時間
說明:
1)第一行為表頭,表頭的格式為:uid,iid,score,time,四個欄位分別代表:使用者編號,商品編號,評分,相對時間;
2)每一行為一個使用者對一個商品的評分,行之間用“回車符”分隔;
3)每一行各欄位之間用“逗號”分隔。
抽取了近55萬條評分記錄,作為測試集。我們隱藏了使用者對於商品的評分,僅保留使用者和商品的評分關係,資料檔名為test.csv,欄位格式為:
uid,iid
使用者k,商品c
使用者l,商品d
說明資訊同訓練集(train.csv)。
推薦系統如今在網際網路上廣泛應用。因其能給網站帶來附加收益,近些年得到了很多關注,基礎演算法理論體系較為成熟。這次參加的推薦比賽,資料集規範,可以說是給新手一個很好認識推薦系統的機會。
使用模型
主要模型:貝葉斯概率矩陣分解(Bayesian Probabilistic Matrix Factorization,BPMF)
融合模型:Keras神經網路(yinjh前輩分享)
BPMF
- PMF
BPMF是基於PMF的一套概率矩陣分解模型。對PMF不熟悉的同學,可以參考這篇文章:概率矩陣分解模型 PMF
在文中,作者有介紹,PMF可以理解為奇異值分解(SVD)的擴充套件,而SVD則是一種特殊的PMF。SVD矩陣分解時,要對原始評分矩陣進行補全。PMF則將使用者和物品都表示在一個F維的隱式特徵空間,假設使用者和商品的隱式特徵向量以及現有的評分資料的條件概率都服從高斯先驗分佈。
通過這一方式,在DC的資料集上,PMF的評分為7.8331。在該階段,位於第五。
- BPMF
BMPF由PMF改進而來,它從貝葉斯角度而不是傳統概率角度求解問題。在實驗條件下,使用Netflix資料集,準確性較PMF可提高1.7%~3.6%。
關於BPMF的詳細介紹,可參考這一博文:【推薦系統演算法】BPMF
在使用BPMF的情況下,測評分數為7.85053。在同一資料集,較PMF提升0.22%。
- Keras神經網路
Keras神經網路使用yinjh前輩分享的程式碼,為了適配本地機型,只做了小幅更改。
本機跑出來的Keras方法效果是7.83312,略微優於PMF。
程式碼分享
在本例中,BPMF程式碼以matlab形式實現。非特別宣告,以下程式碼均在matlab中執行。 程式碼參照:Bayesian Probabilistic Matrix Factorization
資料預處理
% train.csv經過處理,刪掉時間一列 python可實現
M=csvread('trian.csv',1,1) % 第一行為列名,從第二行開始讀取數值
save train M
M(:,1)=M(:,1)+1; % matlab索引從1開始,資料集中有0,加一消除。
M(:,2)=M(:,2)+1;
save test M
N=csvread('test.csv',1,1)
save test N
N(:,1)=N(:,1)+1;
N(:,2)=N(:,2)+1;
save test N
原始宣告
% Code provided by Ruslan Salakhutdinov
%
% Permission is granted for anyone to copy, use, modify, or distribute this
% program and accompanying programs and documents for any purpose, provided
% this copyright notice is retained and prominently displayed, along with
% a note saying that the original programs are available from our
% web page.
% The programs and documents are distributed without any warranty, express or
% implied. As the programs were written for research purposes only, they have
% not been tested to the degree that would be advisable in any important
% application. All use of these programs is entirely at the user's own risk.
pmf.m
rand('state',0);
randn('state',0);
if restart==1
restart=0;
epsilon=50; % Learning rate 學習比率
lambda = 0.01; % Regularization parameter 正則化引數
momentum=0.8;
epoch=1;
maxepoch=50; % 最大階數
load train % Triplets: {user_id, movie_id, rating} 三個一組
mean_rating = mean( M (:,3)); % 平均評分
pairs_tr = length(M); % training data 訓練資料
%%%%%%%%%%% 商品數和使用者資訊須提前獲知,簡單python可實現 %%%%%%%%%%%%
numbatches= 210; % Number of batches 批數
num_m = 14726; % Number of movies 商品數
num_p = 223970; % Number of users 使用者數
num_feat = 10; % Rank 10 decomposition 10次分解?
w1_M1 = 0.1*randn(num_m, num_feat); % Movie feature vectors 商品特徵矩陣
w1_P1 = 0.1*randn(num_p, num_feat); % User feature vecators 使用者特徵矩陣
w1_M1_inc = zeros(num_m, num_feat); % 初始化
w1_P1_inc = zeros(num_p, num_feat); % 初始化
end
for epoch = epoch:maxepoch % 迭代開始
rr = randperm(pairs_tr); % 隨機置換向量
M = M(rr,:); % 訓練集隨機化
clear rr
for batch = 1:numbatches
fprintf(1,'epoch %d batch %d \r',epoch,batch);
N=157987; % number training triplets per batch 每一批的訓練集樣本數
aa_p = double(M((batch-1)*N+1:batch*N,1));
aa_m = double(M((batch-1)*N+1:batch*N,2));
rating = double(M((batch-1)*N+1:batch*N,3));
rating = rating-mean_rating; % Default prediction is the mean rating. 預設為平均評分
%%%%%%%%%%%%%% Compute Predictions 計算預測 %%%%%%%%%%%%%%%%%
pred_out = sum(w1_M1(aa_m,:).*w1_P1(aa_p,:),2); % 2代表矩陣相乘後行求和
f = sum( (pred_out - rating).^2 + ...
0.5*lambda*( sum( (w1_M1(aa_m,:).^2 + w1_P1(aa_p,:).^2),2)));
%%%%%%%%%%%%%% Compute Gradients 計算梯度 %%%%%%%%%%%%%%%%%%%
IO = repmat(2*(pred_out - rating),1,num_feat); % repmat 矩陣重複函式
Ix_m=IO.*w1_P1(aa_p,:) + lambda*w1_M1(aa_m,:);
Ix_p=IO.*w1_M1(aa_m,:) + lambda*w1_P1(aa_p,:);
dw1_M1 = zeros(num_m,num_feat);
dw1_P1 = zeros(num_p,num_feat);
for ii=1:N
dw1_M1(aa_m(ii),:) = dw1_M1(aa_m(ii),:) + Ix_m(ii,:);
dw1_P1(aa_p(ii),:) = dw1_P1(aa_p(ii),:) + Ix_p(ii,:);
end
%%%% Update movie and user features 更新商品和使用者特徵 %%%%%%%%%%%
w1_M1_inc = momentum*w1_M1_inc + epsilon*dw1_M1/N;
w1_M1 = w1_M1 - w1_M1_inc;
w1_P1_inc = momentum*w1_P1_inc + epsilon*dw1_P1/N;
w1_P1 = w1_P1 - w1_P1_inc;
end
%%%%%%%%%%%%%% Compute Predictions after Paramete Updates 欄位更新後的再一次預測 %%%%%%%%%%%%%%%%%
pred_out = sum(w1_M1(aa_m,:).*w1_P1(aa_p,:),2);
f_s = sum( (pred_out - rating).^2 + ...
0.5*lambda*( sum( (w1_M1(aa_m,:).^2 + w1_P1(aa_p,:).^2),2)));
err_train(epoch) = sqrt(f_s/N);
%%% Compute predictions on the test set 測試集上的預測 %%%%%%%%%%%%%%%%%%%%%%
load test
pairs_pr = length(test);
NN=pairs_pr;
aa_p = double(test(:,1));
aa_m = double(test(:,2));
%rating = double(test(:,3));
pred_out = sum(w1_M1(aa_m,:).*w1_P1(aa_p,:),2) + mean_rating;
ff = find(pred_out>5); pred_out(ff)=5; % Clip predictions 預測修剪
ff = find(pred_out<1); pred_out(ff)=1;
fprintf(1, 'epoch %4i batch %4i Training RMSE %6.4f \n', ...
epoch, batch, err_train(epoch));
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if (rem(epoch,10))==0
save pmf_weight w1_M1 w1_P1
end
end
% 結果在工作區的pred_out中,可貼上複製或命令save pre_pmf.csv pred_out。
bayespmf.m
在使用bpmf前,先將訓練集的平均評分賦給測試集,生成score列。
此外,還有兩個關聯函式。
makematrix.m
load train
num_m = 14726;
num_p = 223970;
count = sparse(num_p,num_m); %for Netflida data, use sparse matrix instead.
for mm=1:num_m
ff= find(M(:,2)==mm);
fprintf(1, '\n %d / %d \t \n', mm,num_m);
count(M(ff,1),mm) = M(ff,3);
end
% 將結果儲存,以備下次使用 save makematrix count
pred.m
function [pred_out] = pred(w1_M1_sample,w1_P1_sample,N,mean_rating);
%%% Make predicitions on the validation data
aa_p = double(N(:,1));
aa_m = double(N(:,2));
rating = double(N(:,3));
pred_out = sum(w1_M1_sample(aa_m,:).*w1_P1_sample(aa_p,:),2) + mean_rating;
ff = find(pred_out>5); pred_out(ff)=5;
ff = find(pred_out<1); pred_out(ff)=1;
bayespmf.m主體
rand('state',0);
randn('state',0);
if restart==1
restart=0;
epoch=1;
maxepoch=50;
iter=0;
num_m = 14726;
num_p = 223970;
num_feat = 10;
% Initialize hierarchical priors 初始化分層先驗
beta=2; % observation noise (precision) 觀測噪聲 (精度)
mu_u = zeros(num_feat,1);
mu_m = zeros(num_feat,1);
alpha_u = eye(num_feat); % 返回單位矩陣
alpha_m = eye(num_feat);
% parameters of Inv-Whishart distribution (see paper for details)
WI_u = eye(num_feat);
b0_u = 2;
df_u = num_feat;
mu0_u = zeros(num_feat,1);
WI_m = eye(num_feat);
b0_m = 2;
df_m = num_feat;
mu0_m = zeros(num_feat,1);
load train
load test
mean_rating = mean(M(:,3));
ratings_test = double(N(:,3));
pairs_tr = length(M);
pairs_pr = length(N);
fprintf(1,'Initializing Bayesian PMF using MAP solution found by PMF \n');
load makematrix
load pmf_weight
%count=count'
err_test = cell(maxepoch,1);
w1_P1_sample = w1_P1;
w1_M1_sample = w1_M1;
clear w1_P1 w1_M1;
% Initialization using MAP solution found by PMF.
%% Do simple fit
mu_u = mean(w1_P1_sample)';
d=num_feat;
alpha_u = inv(cov(w1_P1_sample));
mu_m = mean(w1_M1_sample)';
alpha_m = inv(cov(w1_P1_sample));
%count=count';
probe_rat_all = pred(w1_M1_sample,w1_P1_sample,N,mean_rating);
counter_prob=1;
end
for epoch = epoch:maxepoch
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Sample from movie hyperparams (see paper for details)
n = size(w1_M1_sample,1);
x_bar = mean(w1_M1_sample)';
S_bar = cov(w1_M1_sample);
WI_post = inv(inv(WI_m) + n/1*S_bar + ...
n*b0_m*(mu0_m - x_bar)*(mu0_m - x_bar)'/(1*(b0_m+n)));
WI_post = (WI_post + WI_post')/2;
df_mpost = df_m+n;
alpha_m = wishrnd(WI_post,df_mpost);
mu_temp = (b0_m*mu0_m + n*x_bar)/(b0_m+n);
lam = chol( inv((b0_m+n)*alpha_m) ); lam=lam';
mu_m = lam*randn(num_feat,1)+mu_temp;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%% Sample from user hyperparams
n = size(w1_P1_sample,1);
x_bar = mean(w1_P1_sample)';
S_bar = cov(w1_P1_sample);
WI_post = inv(inv(WI_u) + n/1*S_bar + ...
n*b0_u*(mu0_u - x_bar)*(mu0_u - x_bar)'/(1*(b0_u+n)));
WI_post = (WI_post + WI_post')/2;
df_mpost = df_u+n;
alpha_u = wishrnd(WI_post,df_mpost);
mu_temp = (b0_u*mu0_u + n*x_bar)/(b0_u+n);
lam = chol( inv((b0_u+n)*alpha_u) ); lam=lam';
mu_u = lam*randn(num_feat,1)+mu_temp;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Start doing Gibbs updates over user and
% movie feature vectors given hyperparams.
for gibbs=1:2
fprintf(1,'\t\t Gibbs sampling %d \r', gibbs);
%%% Infer posterior distribution over all movie feature vectors
count=count';
for mm=1:num_m
%fprintf(1,'movie =%d\r',mm);
ff = find(count(:,mm)>0);
MM = w1_P1_sample(ff,:);
rr = count(ff,mm)-mean_rating;
covar = inv((alpha_m+beta*MM'*MM));
mean_m = covar * (beta*MM'*rr+alpha_m*mu_m);
lam = chol(covar); lam=lam';
w1_M1_sample(mm,:) = lam*randn(num_feat,1)+mean_m;
end
%%% Infer posterior distribution over all user feature vectors
count=count';
for uu=1:num_p
%fprintf(1,'user =%d\r',uu);
ff = find(count(:,uu)>0);
MM = w1_M1_sample(ff,:);
rr = count(ff,uu)-mean_rating;
covar = inv((alpha_u+beta*MM'*MM));
mean_u = covar * (beta*MM'*rr+alpha_u*mu_u);
lam = chol(covar); lam=lam';
w1_P1_sample(uu,:) = lam*randn(num_feat,1)+mean_u;
end
end
probe_rat = pred(w1_M1_sample,w1_P1_sample,N,mean_rating);
probe_rat_all = (counter_prob*probe_rat_all + probe_rat)/(counter_prob+1);
counter_prob=counter_prob+1;
fprintf(1, '\nEpoch %d \t \n', epoch);
end
% 結果在工作區probe_rat_all。貼上複製或save pre_bpmf.csv probe_rat_all
啟動函式
restart=1;
fprintf(1,'Running Probabilistic Matrix Factorization (PMF) \n');
pmf
restart=1;
fprintf(1,'\nRunning Bayesian PMF\n');
bayespmf
比賽小結
本次比賽,算是另一種嘗試。剋制住了一上來就copy現有程式碼的衝動,而是轉而尋找另外的模型,不斷除錯,不斷修改,也學到了很多東西。在和yinjh前輩的程式碼結果取平均值後,取得了不錯的成績。
在看過yes,boy!分享的程式碼後,發現這裡面其實還有很多東西可以學習。學無止境,保持前進。
小運營問我,DC積分榜排第一是種什麼樣的體驗?我總感覺,虛得很。因為,無論是像yinjh這樣樂於分享的前輩,還是yes,boy!這樣專門搞計算機的,或者是像雲泛天音這樣偶爾冒個泡的大神,他們都要比我強太多。兩次比賽,算是做的初步的嘗試,還有很多很多的東西要學,還有很多很多的問題要解決。
最後借用《銀河英雄傳說》裡的一句話結束此次分享:我們的征途,是星辰大海。