推薦系統綜述與程式碼

By Joey周琦

引言與符號介紹

一般來說，推薦系統可以歸納為，預測user對某item的評分或者點選率。問題描述如下

user對item的互動，主要可以分為下面三個方面：

scalar. (numerical(rating),ordinal). 標量的
binary. (like,dislike) 二項的，0 or 1, 點or不點等
unary. (purchase,online access,etc) 。一元的，如購買，線上等

為了表述清晰，我們做一下符號說明

item：商品，user：使用者（後面可能會混合叫）

$U$ : 使用者集合
$I$ : 商品集合
$R$ : 使用者對商品的評分集合（或點選率等）
$S$ : 評分集合的可能值域 (e,g, $S=[1,...,5] or {like,dislike}$ )
$r_{ui}$ : 使用者u給item i 的評分
$U_i$ : 評價過商品i的使用者集合
$I_u$ : 使用者u評價過的item集合
$I_{uv}=I_u \cap I_v$ , $U_{ij}=U_i \cap U_j$
$N(u)$ : 使用者u的KNN(K個最近鄰）
$N_i(u)$ :評價過item i的使用者u的KNN最近鄰
$N_u(i)$ : 被使用者u評價過的商品i的最近鄰

若用效用函式 $f$ 衡量使用者u對商品i的興趣， i.e,

f:U×I−>S $f:U\times I ->S$ 。那麼對於使用者

u∈U $u \in U$ , 我們希望選擇一個

i∈I $i \in I$ 最大化使用者的興趣. 如下：

∀u∈U, iu=argmaxi∈If(u,i). $\begin{equation} \forall_{u \in U}, ~~ i_u = \arg \max_{i\in I} f(u,i). \end{equation}$

使用者空間 $U$ 的每個使用者可以由一個profile(使用者資料，畫像）來代表，這個profile可以包含使用者基礎屬性（性別，年齡，地域，學歷），歷史行為（使用者ID)等。
商品空間 $I$ 的每個商品可以由商品的內容（標題，類別，等）和歷史行為（ID)來代表。

使用者的explicit feedback (顯性反饋）行為可以分為下面三種，這類資訊比較少，但比較準確。
* like/dislike
* ratings
* text comments

使用者的implicit feedback (隱形反饋）行為可以為下面這些：儲存，點選，放棄，列印，標籤。這些行為不需要使用者直接性的參與。這類資訊比較多，但沒有顯性那麼準確。

一般來說推薦系統可以分為兩類

基於內容的推薦系統（Content-based recommendations,CB）. 根據使用者的瀏覽歷史item的內容，給使用者推薦相似內容的item
協同濾波推薦系統 (Collaborative filtering recommendations, CF). 給使用者推薦與其相似使用者看過的商品user-CF，或給使用者推薦其看過的商品的相似商品item-CF. 這裡的相似不是從內容分析，而是看過相同內容越多的使用者則越相似，被相同使用者看過越多則越相似。
混合上述方法

基於內容的推薦系統Content based

基於內容的推薦系統（CB)通過分析user過去的評分、點選等行為，為每一個user建立一個畫像(profile)或模型(model). 畫像可以結構化代表user的興趣等，可以用來給使用者帶來新的推薦。基於內容的資訊過濾需要一些技術（自然語言處理，資訊提取，如tf-idf等）來代表每個物品item和使用者畫像，並且需要策略對比使用者畫像與item的相似度。

基於內容的效用函式可以定義為 $f (u,i)$ :

f(u,i)=score(ContentBasedProfile(u),Content(i)) $\begin{equation} f (u,i) = score(ContentBasedProfile(u),Content(i)) \end{equation}$

$Content(i)$ 是item的畫像. $ContentBasedProfile(u)$ is the使用者u的畫像。
$Content(i)$ and $ContentBasedProfile(u)$ 可以用tf-idf向量（或其他技術）代表。

很多機器學習的方法，如樸素貝葉斯、神經網路、決策樹等演算法可以應用於基於內容的推薦中。（個人理解是利用向量代表作為特徵，點選或平分作為label,來訓練分類演算法？）

基於內容的推薦框架如下圖

這裡寫圖片描述

CB有如下優勢：

USER INDEPENDENCE - Content-based recommenders exploit solely ratings provided by the active user to build her own profile
使用者獨立性：只根據使用者自己的歷史行為構建使用者畫像
可解釋性強
CB可以推薦沒有任何行為的item,因為可以通過內容分析

CB的一些限制：

有限的內容分析：對於聲音、影象的內容分析技術有限
推薦過細：系統只會推薦和使用者歷史行為內容相關的item,缺乏新穎性
對新使用者推薦有難度（因為沒有歷史行為)

代表item(Item Representation)

在大部分cb系統中，item的描述是從網頁，文章，評論，內容等重提取的文字特徵。由於語言的多義性（language ambiguity），對於文字特徵，一些複雜的問題需要處理：

POLYSEMY, the presence of multiple meanings for one word;
POLYSEMY, 一詞多義
SYNONYMY, 多詞同義

基於關鍵詞的向量空間模型

大部分基於內容的推薦系統都是比較簡單的檢索模型，如關鍵字匹配或向量空間模型（基於td-idf權重)。假設
Most content-based recommender systems use relatively simple retrieval models, such as keyword matching or the Vector Space Model (VSM) with basic TF-IDF weighting. Let $D ={d_1,d_2,\cdots,d_N}$ denote a set of documents or corpus, and $T={t_1,t_2,\cdots, T_N}$ be the dictionary, that is to say the set of words in the corpus. $T$ is obtained by applying
some standard natural language processing operations, such as tokenization, stopwords
removal, and stemming. Each item $d_j$ is represented as a vector in a $n$ -dimensional vector space, so $d_j={w_{1j},w_{2j},\cdots,w_{nj}}$ ,where $w_{kj}$ is the weight for term $t_k$ in corpus $d _j$ . TF-IDF weighting is based on the assumption that:

rare terms are not less relevant than frequent terms (IDF assumption);
multiple occurrences of a term in a document are not less relevant than single
occurrences (TF assumption);
long documents are not preferred to short documents (normalization assumption).

TF-IDF is calculated as follows:

TF−IDF(tk,dj)=TF(tk,dj)log˙Nnk $\begin{equation} TF-IDF(t_k,d_j) = TF(t_k,d_j) \dot \log \frac{N}{n_k} \end{equation}$

The second term is IDF, and the first term TF can be calculated as follows

TF(tk,dj)=fk,jmaxzfz,j $\begin{equation} TF(t_k,d_j) = \frac {f_{k,j}} {\max_z f_{z,j}} \end{equation}$
where

fk,j $f_{k,j}$ is the frequencies that term

tk $t_k$ occur in document

dj $d_j$ . Then we get the weight after cosine normalization

推薦系統綜述與程式碼

推薦系統綜述與程式碼

引言與符號介紹

基於內容的推薦系統Content based

代表item(Item Representation)

基於關鍵詞的向量空間模型

推薦系統綜述與程式碼

基於深度學習的推薦系統:綜述與新視角

基於深度學習的推薦系統綜述 (arxiv 1707.07435) 譯文 3.1 ~ 3.3

58同城推薦系統設計與實現

推薦系統綜述：初識推薦系統

個性化智慧推薦系統分析與調研

Mahout--最基本的推薦系統的JAVA程式碼

推薦系統的探索與利用問題綜述

大數據入門第十九天——推薦系統與mahout（一）入門與概述

推薦系統-01-電影推薦與結果評估

連載00：推薦：軟件體系設計新方向：數學抽象、設計模式、系統架構與方案設計(簡化版)（袁曉河著）

推薦系統：矩陣分解與鄰域的融合模型

推薦系統-協同過濾原理與實現

文獻綜述三：基於JSP的商品資訊管理系統設計與開發

【Mark Schmidt課件】機器學習與資料探勘——推薦系統

推薦書籍必有連結】利用python進行資料分析與推薦系統

大資料推薦系統演算法程式碼全接觸（機器學習演算法+Spark實現）

文獻綜述十四：基於Oracle11g的超市進銷存管理系統設計與實現

9、生鮮電商平臺-推薦系統模組的設計與架構

9、生鮮電商平臺-推薦系統模塊的設計與架構

推薦系統綜述與程式碼

推薦系統綜述與程式碼

引言與符號介紹

基於內容的推薦系統Content based

代表item(Item Representation)

基於關鍵詞的向量空間模型

相關推薦