Python利用Faiss庫實現ANN近鄰搜尋的方法詳解

阿新 • • 發佈：2020-08-04

Embedding的近鄰搜尋是當前圖推薦系統非常重要的一種召回方式，通過item2vec、矩陣分解、雙塔DNN等方式都能夠產出訓練好的user embedding、item embedding，對於embedding的使用非常的靈活：

輸入user embedding，近鄰搜尋item embedding，可以給user推薦感興趣的items
輸入user embedding，近鄰搜搜user embedding，可以給user推薦感興趣的user
輸入item embedding，近鄰搜尋item embedding，可以給item推薦相關的items

然而有一個工程問題，一旦user embedding、item embedding資料量達到一定的程度，對他們的近鄰搜尋將會變得非常慢，如果離線階段提前搜尋好在快取記憶體比如redis儲存好結果當然沒問題，但是這種方式很不實時，如果能線上階段上線幾十MS的搜尋當然效果最好。

Faiss是Facebook AI團隊開源的針對聚類和相似性搜尋庫，為稠密向量提供高效相似度搜索和聚類，支援十億級別向量的搜尋，是目前最為成熟的近似近鄰搜尋庫。

接下來通過jupyter notebook的程式碼，給大家演示下使用faiss的簡單流程，內容包括：

讀取訓練好的Embedding資料
構建faiss索引，將待搜尋的Embedding新增進去
取得目標Embedding，實現搜尋得到ID列表
根據ID獲取電影標題，返回結果

對於已經訓練好的Embedding怎樣實現高速近鄰搜尋是一個工程問題，facebook的faiss庫可以構建多種embedding索引實現目標embedding的高速近鄰搜尋，能夠滿足線上使用的需要

安裝命令：

conda install -c pytorch faiss-cpu

提前總結下faiss使用經驗：

1. 為了支援自己的ID，可以用faiss.IndexIDMap包裹faiss.IndexFlatL2即可

2. embedding資料都需要轉換成np.float32，包括索引中的embedding以及待搜尋的embedding

3. ids需要轉換成int64型別

1. 準備資料

import pandas as pd
import numpy as np

df = pd.read_csv("./datas/movielens_sparkals_item_embedding.csv")
df.head()

id features

0 10 [0.25866490602493286,0.3560594320297241,0.15…

1 20 [0.12449632585048676,-0.29282501339912415,-0…

2 30 [0.9557555317878723,0.6764761805534363,0.114…

3 40 [0.3184879720211029,0.6365472078323364,0.596…

4 50 [0.45523127913475037,0.34402626752853394,-0….

	id	features
0	10	[0.25866490602493286,0.3560594320297241,0.15…
1	20	[0.12449632585048676,-0.29282501339912415,-0…
2	30	[0.9557555317878723,0.6764761805534363,0.114…
3	40	[0.3184879720211029,0.6365472078323364,0.596…
4	50	[0.45523127913475037,0.34402626752853394,-0….

構建ids

ids = df["id"].values.astype(np.int64)
type(ids),ids.shape
(numpy.ndarray,(3706,))
ids.dtype
dtype('int64')
ids_size = ids.shape[0]
ids_size
3706

構建datas

import json
import numpy as np
datas = []
for x in df["features"]:
 datas.append(json.loads(x))
datas = np.array(datas).astype(np.float32)
datas.dtype
dtype('float32')
datas.shape
(3706,10)
datas[0]
array([ 0.2586649,0.35605943,0.15589039,-0.7067125,-0.07414215,-0.62500805,-0.0573845,0.4533663,0.26074877,-0.60799956],dtype=float32)
# 維度
dimension = datas.shape[1]
dimension
10

2. 建立索引

import faiss
index = faiss.IndexFlatL2(dimension)
index2 = faiss.IndexIDMap(index)
ids.dtype
dtype('int64')
index2.add_with_ids(datas,ids)
index.ntotal
3706

4. 搜尋近鄰ID列表

df_user = pd.read_csv("./datas/movielens_sparkals_user_embedding.csv")
df_user.head()
id features

id features

0 10 [0.5974288582801819,0.17486965656280518,0.04…

1 20 [1.3099910020828247,0.5037978291511536,0.260…

2 30 [-1.1886241436004639,-0.13511677086353302,0….

3 40 [1.0809299945831299,1.0048035383224487,0.986…

4 50 [0.42388680577278137,0.5294889807701111,-0.6…

	id	features
0	10	[0.5974288582801819,0.17486965656280518,0.04…
1	20	[1.3099910020828247,0.5037978291511536,0.260…
2	30	[-1.1886241436004639,-0.13511677086353302,0….
3	40	[1.0809299945831299,1.0048035383224487,0.986…
4	50	[0.42388680577278137,0.5294889807701111,-0.6…

user_embedding = np.array(json.loads(df_user[df_user["id"] == 10]["features"].iloc[0]))
user_embedding = np.expand_dims(user_embedding,axis=0).astype(np.float32)
user_embedding
array([[ 0.59742886,0.17486966,0.04345559,-1.3193961,0.5313592,-0.6052168,-0.19088413,1.5307966,0.09310367,-2.7573566 ]],dtype=float32)
user_embedding.shape
(1,10)
user_embedding.dtype
dtype('float32')
topk = 30
D,I = index.search(user_embedding,topk) # actual search
I.shape
(1,30)
I
array([[3380,2900,1953,121,3285,999,617,747,2351,601,2347,42,2383,538,1774,980,2165,3049,2664,367,3289,2866,2452,547,1072,2055,3660,3343,3390,3590]])

5. 根據電影ID取出電影資訊

target_ids = pd.Series(I[0],name="MovieID")
target_ids.head()
0 3380
1 2900
2 1953
3 121
4 3285
Name: MovieID,dtype: int64
df_movie = pd.read_csv("./datas/ml-1m/movies.dat",sep="::",header=None,engine="python",names = "MovieID::Title::Genres".split("::"))
df_movie.head()

	MovieID	Title	Genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy

df_result = pd.merge(target_ids,df_movie)
df_result.head()

	MovieID	Title	Genres
0	3380	Railroaded! (1947)	Film-Noir
1	2900	Monkey Shines (1988)	Horror\|Sci-Fi
2	1953	French Connection,The (1971)	Action\|Crime\|Drama\|Thriller
3	121	Boys of St. Vincent,The (1993)	Drama
4	3285	Beach,The (2000)	Adventure\|Drama

總結

到此這篇關於Python利用Faiss庫實現ANN近鄰搜尋的文章就介紹到這了,更多相關Python用Faiss庫ANN近鄰搜尋內容請搜尋我們以前的文章或繼續瀏覽下面的相關文章希望大家以後多多支援我們！

Python利用Faiss庫實現ANN近鄰搜尋的方法詳解

Python利用Faiss庫實現ANN近鄰搜尋的方法詳解

python使用協程實現併發操作的方法詳解

python利用datetime庫實現通過身份證號碼判斷年齡、男女，並輸出生日

Python實現快速排序的方法詳解

Python利用全連線神經網路求解MNIST問題詳解

Python通過VGG16模型實現影象風格轉換操作詳解

python圖形開發GUI庫pyqt5的基本使用方法詳解

Python使用qrcode二維碼庫生成二維碼方法詳解

Python pyautogui模組實現滑鼠鍵盤自動化方法詳解

Python影象處理庫PIL的ImageGrab模組介紹詳解

Python影象處理庫PIL的ImageDraw模組介紹詳解

python開發例項之python使用Websocket庫開發簡單聊天工具例項詳解(python+Websocket+JS)

Python實現畫圖軟體功能方法詳解

python中Pandas 庫函式DataFrame的基本屬性詳解

python使用scapy模組實現ping掃描的過程詳解

基於redis實現定時任務的方法詳解

MySQL和Redis實現二級快取的方法詳解

Mysql實現增量恢復的方法詳解

python程序的狀態、建立及使用方法詳解

python解析命令列引數的三種方法詳解

Python利用Faiss庫實現ANN近鄰搜尋的方法詳解

相關推薦