1. 程式人生 > >利用SVD矩陣分解進行k次交叉實驗和Top—N推薦

利用SVD矩陣分解進行k次交叉實驗和Top—N推薦

如果上一節沒看的,請先看上一節Surprise專案的使用。本文利用開源GitHub專案Surprise

上一節說到具體的安裝和一些方法的屬性,本節將以SVD為例具體的程式碼demo的實現。

先說下如何利用Surprise載入本地資料集進行k次交叉實驗,如果看下API,其實非常簡單,體現了Surprise的強大,下面為程式碼:

# -*- coding: utf-8 -*-
"""
Created on Mon Aug  7 13:09:08 2017

@author: Jipon
"""

from surprise import SVD
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import dataset


#載入本地資料集進行3次交叉實驗

#每行資料型別為user item rating,依據空格來分割
reader=dataset.Reader(line_format='user item rating', sep=' ')
data =Dataset.load_from_file('C:\\Users\\Jipon\\Desktop\\surprise\\train.txt',reader)
#定義3次交叉實驗,如果不定義這句預設為5次
data.split(n_folds=3)

# We'll use the famous SVD algorithm.
algo = SVD()

# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)

上面程式碼展示了利用SVD載入本地資料集進行推薦(資料集和程式碼連結在本文末尾),評估方法為RMSE和MAE,官方文件評價指標沒有準確度和召回率,如果我們需要這兩個評價指標可以自己定義,具體請參考官網。


在做推薦系統的過程中我們經常使用TopN方法進行推薦,具體程式碼如下:

# -*- coding: utf-8 -*-
"""
Created on Tue Aug  8 13:27:08 2017

@author: Jipon
"""

from collections import defaultdict

from surprise import SVD
from surprise import Dataset
from surprise import dataset

def get_top_n(predictions, n=10):
    '''Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    '''

    # First map the predictions to each user.,這句預設的list型別
    top_n = defaultdict(list)
    
    #uid為使用者id,iid為專案id,true_r為真實的概率,est為分解後的估值
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


# 載入資料集

reader=dataset.Reader(line_format='user item rating', sep=' ')
data =Dataset.load_from_file('C:\\Users\\Jipon\\Desktop\\surprise\\train.txt',reader)
trainset = data.build_full_trainset()
algo = SVD()
algo.train(trainset)

#推薦不在訓練資料集裡得Top—N個數據
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n(predictions, n=2)

# Print the recommended items for each user
for uid, user_ratings in top_n.items():
print(uid, [iid for (iid, _) in user_ratings])

實驗結果為:


當然,然後你就可以用推薦的Top-N資料進行準確度和召回率的計算了。
了。是不是非常簡單?

上述程式碼和資料集連結:

https://github.com/Jipon/SVDTest