推薦演算法之: DeepFM及使用DeepCTR測試
阿新 • • 發佈:2020-10-16
## 演算法介紹
左邊deep network,右邊FM,所以叫deepFM
![DeepFm](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602764397990.png)
包含兩個部分:
- Part1: FM(Factorization machines),因子分解機部分
![FM](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602764448337.png)
在傳統的一階線性迴歸之上,加了一個二次項,可以表達兩兩特徵的相互關係。
![特徵相互關係](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602764632765.png)
這裡的公式可以簡化,減少計算量,下圖來至於網路。
![FM](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602765047458.png)
- Part2: Deep部分
deep部分是多層dnn網路。
## 演算法實現
實現部分,[用Keras實現一個DeepFM](https://blog.csdn.net/songbinxu/article/details/80151814) 和[·清塵·《FM、FMM、DeepFM整理(pytorch)》](https://blog.csdn.net/u012969412/article/details/88684723)
講的比較清楚,這裡引用keras實現來說明。
整體的網路結構:
![網路結構](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602764969628.png)
### 特徵編碼
特徵可以分為3類:
- 連續型field,比如數字型別特徵
- 單值離散型特徵,比如gender,可選為male、female
- 多值離散型,比如tag,可以有多個
連續型field,可以拼接到一起,dense資料。
單值,多值field進行Onehot後,可見單值離散field對應的獨熱向量只有一位取1,而多值離散field對應的獨熱向量有多於一位取1,表示該field可以同時取多個特徵值。
| label | shop_score | gender=m | gender=f | interest=f | interest=c |
| --- | --- | --- | --- | --- | --- |
| 0 | 0.2 | 1 | 0 | 1 | 1 |
| 1 | 0.8 | 0 | 1 | 0 | 1 |
### FM 部分
![FM](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602765456508.png)
看公式:
![FM](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602764448337.png)
先算 FM一次項:
- 連續型field 可以用Dense(1)層實現
- 單值離散型field 用Embedding(n,1), n是分類中值的個數
- 多值離散型field可以同時取多個特徵值,為了batch training,必須對樣本進行補零padding。同樣可以用Embedding實現,因為有多個Embedding,可以取下平均值。
![1次項](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/15/1602765746903.png)
然後計算FM二次項,這裡理解比較費勁一點。
[·清塵·《FM、FMM、DeepFM整理(pytorch)》](https://blog.csdn.net/u012969412/article/details/88684723) 深入淺出的講明白了這個過程,大家可以參見。
我們來看具體實現方面,這裡的[DeepFM模型CTR預估理論與實戰
](http://fancyerii.github.io/2019/12/19/deepfm/#fm-layer) 講解更容易理解。
![FM公式](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/16/1602819038585.png)
假設只有前面的C1和C2兩個Category的特徵,詞典大小還是3和2。假設輸入還是C1=2,C2=2(下標從1開始),則Embedding之後為V2=[e21,e22,e23,e24]和V5=[e51,e52,e53,e54]。
因為xi和xj同時不為零才需要計算,所以上面的公式裡需要計算的只有i=2和j=5的情況。因此:
![FM](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/16/1602819112460.png)
擴充套件到多個,比如C1,C2,C3,需要算內積
![](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/16/1602819135487.png)
怎麼用用矩陣乘法一次計算出來呢?我們可以看看這個
![](https://gitee.com/jadepeng/pic/raw/master/pic/2020/10/16/1602819176970.png)
對應的程式碼就是
square_of_sum = tf.square(reduce_sum(
concated_embeds_value, axis=1, keep_dims=True))
sum_of_square = reduce_sum(
concated_embeds_value * concated_embeds_value, axis=1, keep_dims=True)
cross_term = square_of_sum - sum_of_square
cross_term = 0.5 * reduce_sum(cross_term, axis=2, keep_dims=False)
其中concated_embeds_value是拼接起來的embeds_value。
### Deep部分
DNN比較簡單,FM的輸入和DNN的輸入都是同一個group_embedding_dict。
## 使用movielens 來測試
下載`ml-100k` 資料集
```bash
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
```
安裝相關軟體包,sklearn,deepctr
匯入包:
``` python
import pandas
import pandas as pd
import sklearn
from sklearn.metrics import log_loss, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
from tqdm import tqdm
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, VarLenSparseFeat, get_feature_names
import numpy as np
```
讀取評分資料:
``` javascript
u_data = pd.read_csv("ml-100k/u.data", sep='\t', header=None)
u_data.columns = ['user_id', 'movie_id', 'rating', 'timestamp']
```
有評分的設定為1,隨機採用未評分的
``` python
def neg_sample(u_data, neg_rate=1):
# 全域性隨機取樣
item_ids = u_data['movie_id'].unique()
print('start neg sample')
neg_data = []
# 負取樣
for user_id, hist in tqdm(u_data.groupby('user_id')):
# 當前使用者movie
rated_movie_list = hist['movie_id'].tolist()
candidate_set = list(set(item_ids) - set(rated_movie_list))
neg_list_id = np.random.choice(candidate_set, size=len(rated_movie_list) * neg_rate, replace=True)
for id in neg_list_id:
neg_data.append([user_id, id, -1, 0])
u_data_neg = pd.DataFrame(neg_data)
u_data_neg.columns = ['user_id', 'movie_id', 'rating', 'timestamp']
u_data = pandas.concat([u_data, u_data_neg])
print('end neg sample')
return u_data
```
讀取item資料
``` python
u_item = pd.read_csv("ml-100k/u.item", sep='|', header=None, error_bad_lines=False)
genres_columns = ['Action', 'Adventure',
'Animation',
'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film_Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi',
'Thriller', 'War', 'Western']
u_item.columns = ['movie_id', 'title', 'release_date', 'video_date', 'url', 'unknown'] + genres_columns
```
處理genres並刪除單獨的genres列
``` python
genres_list = []
for index, row in u_item.iterrows():
genres = []
for item in genres_columns:
if row[item]:
genres.append(item)
genres_list.append('|'.join(genres))
u_item['genres'] = genres_list
for item in genres_columns:
del u_item[item]
```
讀取使用者資訊:
``` python
# user id | age | gender | occupation(職業) | zip code(郵編,地區)
u_user = pd.read_csv("ml-100k/u.user", sep='|', header=None)
u_user.columns = ['user_id', 'age', 'gender', 'occupation', 'zip']
```
join到一起:
``` python
data = pandas.merge(u_data, u_item, on="movie_id", how='left')
data = pandas.merge(data, u_user, on="user_id", how='left')
data.to_csv('ml-100k/data.csv', index=False)
```
處理特徵:
sparse_features = ["movie_id", "user_id",
"gender", "age", "occupation", "zip", ]
data[sparse_features] = data[sparse_features].astype(str)
target = ['rating']
# 評分
data['rating'] = [1 if int(x) >= 0 else 0 for x in data['rating']]
先特徵編碼:
for feat in sparse_features:
lbe = LabelEncoder()
data[feat] = lbe.fit_transform(data[feat])
處理genres特徵,一個movie有多個genres,先拆分,然後編碼為數字,注意是從1開始;由於每個movie的genres長度不一樣,可以計算最大長度,位數不足的後面補零(pad_sequences,在post補0)
``` python
def split(x):
key_ans = x.split('|')
for key in key_ans:
if key not in key2index:
# Notice : input value 0 is a special "padding",so we do not use 0 to encode valid feature for sequence input
key2index[key] = len(key2index) + 1
return list(map(lambda x: key2index[x], key_ans))
key2index = {}
genres_list = list(map(split, data['genres'].values))
genres_length = np.array(list(map(len, genres_list)))
max_len = max(genres_length)
# Notice : padding=`post`
genres_list = pad_sequences(genres_list, maxlen=max_len, padding='post', )
```
構建deepctr的特徵列,主要分為兩類特徵,一是定長的SparseFeat,稀疏的類別特徵,二是可變長度的VarLenSparseFeat,像genres這樣的包含多個的。
``` python
fixlen_feature_columns = [SparseFeat(feat, data[feat].nunique(), embedding_dim=4)
for feat in sparse_features]
use_weighted_sequence = False
if use_weighted_sequence:
varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres', vocabulary_size=len(
key2index) + 1, embedding_dim=4), maxlen=max_len, combiner='mean',
weight_name='genres_weight')] # Notice : value 0 is for padding for sequence input feature
else:
varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres', vocabulary_size=len(
key2index) + 1, embedding_dim=4), maxlen=max_len, combiner='mean',
weight_name=None)] # Notice : value 0 is for padding for sequence input feature
linear_feature_columns = fixlen_feature_columns + varlen_feature_columns
dnn_feature_columns = fixlen_feature_columns + varlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
```
封裝訓練資料,先shuffle(亂排)資料,然後生成dict input資料。
data = sklearn.utils.shuffle(data)
train_model_input = {name: data[name] for name in sparse_features} #
train_model_input["genres"] = genres_list
構建DeepFM模型,由於目標值是0,1,因此採用binary,損失函式用binary_crossentropy
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='binary')
model.compile(optimizer=tf.keras.optimizers.Adam(), loss='binary_crossentropy',
metrics=['AUC', 'Precision', 'Recall'])
model.summary()
訓練模型:
model.fit(train_model_input, data[target].values,
batch_size=256, epochs=20, verbose=2,
validation_split=0.2
)
開始訓練:
``` python
Epoch 1/20
625/625 - 3s - loss: 0.5081 - auc: 0.8279 - precision: 0.7419 - recall: 0.7695 - val_loss: 0.4745 - val_auc: 0.8513 - val_precision: 0.7563 - val_recall: 0.7936
Epoch 2/20
625/625 - 2s - loss: 0.4695 - auc: 0.8538 - precision: 0.7494 - recall: 0.8105 - val_loss: 0.4708 - val_auc: 0.8539 - val_precision: 0.7498 - val_recall: 0.8127
Epoch 3/20
625/625 - 2s - loss: 0.4652 - auc: 0.8564 - precision: 0.7513 - recall: 0.8139 - val_loss: 0.4704 - val_auc: 0.8545 - val_precision: 0.7561 - val_recall: 0.8017
Epoch 4/20
625/625 - 2s - loss: 0.4624 - auc: 0.8579 - precision: 0.7516 - recall: 0.8146 - val_loss: 0.4724 - val_auc: 0.8542 - val_precision: 0.7296 - val_recall: 0.8526
Epoch 5/20
625/625 - 2s - loss: 0.4607 - auc: 0.8590 - precision: 0.7521 - recall: 0.8173 - val_loss: 0.4699 - val_auc: 0.8550 - val_precision: 0.7511 - val_recall: 0.8141
Epoch 6/20
625/625 - 2s - loss: 0.4588 - auc: 0.8602 - precision: 0.7545 - recall: 0.8165 - val_loss: 0.4717 - val_auc: 0.8542 - val_precision: 0.7421 - val_recall: 0.8265
Epoch 7/20
625/625 - 2s - loss: 0.4574 - auc: 0.8610 - precision: 0.7535 - recall: 0.8192 - val_loss: 0.4722 - val_auc: 0.8547 - val_precision: 0.7549 - val_recall: 0.8023
Epoch 8/20
625/625 - 2s - loss: 0.4561 - auc: 0.8619 - precision: 0.7543 - recall: 0.8201 - val_loss: 0.4717 - val_auc: 0.8548 - val_precision: 0.7480 - val_recall: 0.8185
Epoch 9/20
625/625 - 2s - loss: 0.4531 - auc: 0.8643 - precision: 0.7573 - recall: 0.8210 - val_loss: 0.4696 - val_auc: 0.8583 - val_precision: 0.7598 - val_recall: 0.8103
Epoch 10/20
625/625 - 2s - loss: 0.4355 - auc: 0.8768 - precision: 0.7787 - recall: 0.8166 - val_loss: 0.4435 - val_auc: 0.8769 - val_precision: 0.7756 - val_recall: 0.8293
Epoch 11/20
625/625 - 2s - loss: 0.4093 - auc: 0.8923 - precision: 0.7915 - recall: 0.8373 - val_loss: 0.4301 - val_auc: 0.8840 - val_precision: 0.7806 - val_recall: 0.8390
Epoch 12/20
625/625 - 2s - loss: 0.3970 - auc: 0.8988 - precision: 0.7953 - recall: 0.8497 - val_loss: 0.4286 - val_auc: 0.8867 - val_precision: 0.7903 - val_recall: 0.8299
Epoch 13/20
625/625 - 2s - loss: 0.3896 - auc: 0.9029 - precision: 0.8001 - recall: 0.8542 - val_loss: 0.4253 - val_auc: 0.8888 - val_precision: 0.7913 - val_recall: 0.8322
Epoch 14/20
625/625 - 2s - loss: 0.3825 - auc: 0.9067 - precision: 0.8038 - recall: 0.8584 - val_loss: 0.4205 - val_auc: 0.8917 - val_precision: 0.7885 - val_recall: 0.8506
Epoch 15/20
625/625 - 2s - loss: 0.3755 - auc: 0.9102 - precision: 0.8074 - recall: 0.8624 - val_loss: 0.4204 - val_auc: 0.8940 - val_precision: 0.7868 - val_recall: 0.8607
Epoch 16/20
625/625 - 2s - loss: 0.3687 - auc: 0.9136 - precision: 0.8117 - recall: 0.8653 - val_loss: 0.4176 - val_auc: 0.8956 - val_precision: 0.8097 - val_recall: 0.8236
Epoch 17/20
625/625 - 2s - loss: 0.3617 - auc: 0.9170 - precision: 0.8155 - recall: 0.8682 - val_loss: 0.4166 - val_auc: 0.8966 - val_precision: 0.8056 - val_recall: 0.8354
Epoch 18/20
625/625 - 2s - loss: 0.3553 - auc: 0.9201 - precision: 0.8188 - recall: 0.8716 - val_loss: 0.4168 - val_auc: 0.8977 - val_precision: 0.7996 - val_recall: 0.8492
Epoch 19/20
625/625 - 2s - loss: 0.3497 - auc: 0.9227 - precision: 0.8214 - recall: 0.8741 - val_loss: 0.4187 - val_auc: 0.8973 - val_precision: 0.8079 - val_recall: 0.8358
Epoch 20/20
625/625 - 2s - loss: 0.3451 - auc: 0.9248 - precision: 0.8244 - recall: 0.8753 - val_loss: 0.4210 - val_auc: 0.8982 - val_precision: 0.7945 - val_recall: 0.8617
```
最後我們測試下資料:
``` python
pred_ans = model.predict(train_model_input, batch_size=256)
count = 0
for (i, j) in zip(pred_ans, data['rating'].values):
print(i, j)
count += 1
if count > 10:
break
```
輸出如下:
``` bash
[0.20468083] 0
[0.1988303] 0
[7.7236204e-05] 0
[0.9439401] 1
[0.76648283] 0
[0.80082995] 1
[0.7689271] 0
[0.8515004] 1
[0.93311656] 1
[0.40019292] 0
[0.60735244] 0
```
## 參考
- [deepFM in pytorch](https://blog.csdn.net/w55100/article/details/90295932)
- [皮果提《Factorization Machines 學習筆記(二)模型方程》](https://blog.csdn.net/itplus/article/details/40534923)
- [·清塵·《FM、FMM、DeepFM整理(pytorch)》](https://blog.csdn.net/u012969412/article/details/88684723)
- [用Keras實現一個DeepFM](https://blog.csdn.net/songbinxu/article/details/80151814)
---
>作者:Jadepeng
出處:jqpeng的技術記事本--[http://www.cnblogs.com/xiaoqi](http://www.cnblogs.com/xiaoqi)
您的支援是對博主最大的鼓勵,感謝您的認真閱讀。
本文版權歸作者所有,歡迎轉載,但未經作者同意必須保留此段宣告,且在文章頁面明顯位置給出原文連線,否則保留追究法律責任的