1. 程式人生 > 實用技巧 >macOS 下sentence transformers分析評論初體驗

macOS 下sentence transformers分析評論初體驗

參考:

中文翻譯版:https://mp.weixin.qq.com/s/gs0MI9xSM-NCPnMGjDGlWA

英文原版:https://int8.io/are-you-ok-cyberpunk-transformers-diagnosis/#Some_technical_details_8211_data_source_preprocessing_embedding

原作者部分程式碼:https://gist.github.com/int8/6684f968b252314cc8b5b87296ea2367

目標:使用sentence transformers將評論向量化,降維,分組,分詞,統計詞頻,生成詞雲

評論資料清洗:去掉emoji,繁體轉為簡體中文使用snownlp (37M)

# 繁體轉簡體 這個庫比較舊了 編碼都是unicode  utf-16就是unicode
from snownlp import SnowNLP

s = SnowNLP(sentence)
all_sentences.append(s.han) # 將繁體轉為簡體中文

環境搭建攔路虎:

pytorch 100M+,pip安裝慢,並且中斷了就得從頭來,參考:https://www.pythonf.cn/read/132147

到官網用瀏覽器下載對應版本,https://download.pytorch.org/whl/torch_stable.html

下載到本地後,由於使用了virtual env,在PyCharm的Python Console中安裝whl檔案,numpy 15M 勉強還行

>>> import pip
>>> from pip._internal.cli.main import main as pipmain
>>> pipmain(['install', '/Users/xxx/Downloads/torch-1.7.1-cp38-none-macosx_10_9_x86_64.whl'])
Processing
/Users/xxx/Downloads/torch-1.7.1-cp38-none-macosx_10_9_x86_64.whl Collecting numpy Downloading numpy
-1.19.4-cp38-cp38-macosx_10_9_x86_64.whl (15.3 MB) Collecting typing-extensions Downloading typing_extensions-3.7.4.3-py3-none-any.whl (22 kB) Installing collected packages: typing-extensions, numpy, torch Successfully installed numpy-1.19.4 torch-1.7.1 typing-extensions-3.7.4.3 0

這個下載也要好一會scipy-1.5.4-cp38-cp38-macosx_10_9_x86_64.whl (29.0 MB) scikit_learn-0.24.0-cp38-cp38-macosx_10_9_x86_64.whl (7.2 MB)

最後執行

pip3 install sentence_transformers

...

Successfully installed certifi-2020.12.5 chardet-4.0.0 click-7.1.2 filelock-3.0.12 idna-2.10 joblib-1.0.0 nltk-3.5 packaging-20.8 pyparsing-2.4.7 regex-2020.11.13 requests-2.25.1 sacremoses-0.0.43 scikit-learn-0.24.0 scipy-1.5.4 sentence-transformers-0.4.0 sentencepiece-0.1.94 six-1.15.0 threadpoolctl-2.1.0 tokenizers-0.9.4 tqdm-4.54.1 transformers-4.1.1 urllib3-1.26.2

模型下載

程式碼執行到這裡時會自動下載模型

model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')
embeddings = model.encode(all_sentences)  # all_sentences: ['評論1', '評論2', ...]

1%| | 10.2M/1.31G [01:36<3:25:46, 106kB/s]

使用科學.上網.後速度快多了

# 使用umap將sentence_transformers生成1024維降到2維

# pip3 install umap-learn # 不能用pip3 install umap

import umap
umap_embeddings = umap.UMAP(n_neighbors=15,
                            min_dist=0.0,
                            n_components=2,
                            metric='cosine').fit_transform(embeddings)

umap官方文件:https://umap-learn.readthedocs.io/en/latest/

引數介紹:https://umap-learn.readthedocs.io/en/latest/parameters.html

# n_neighbors
# 預設15,2時是呈現的是無關聯、散點圖(有時單個點也會被去掉),200時是整體的
# This effect well exemplifies the local/global tradeoff provided by n_neighbors.

# min_dist
# 兩個點間有多緊密
# 預設 0.1  範圍0.0 到 0.99   需要clustering 或 拓撲結構時值要小。大的值則避免點在一起,得到 broad topological structure

# n_components  指定降維後的緯度

# metric parameter. 
# This controls how distance is computed in the ambient(環繞的) space of the input data.

# 聚類分析

from sklearn.cluster import KMeans

num_clusters = 10
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(umap_embeddings)
cluster_assignment = clustering_model.labels_

中文分詞工具

# from minlptokenizer.tokenizer import MiNLPTokenizer  # 需要 tensorflow 安裝報錯就沒用

網路無力吐槽了,普通情況就4.9 kB/s

更換為了北大開源了中文分詞工具包
import pkuseg  # https://github.com/lancopku/PKUSeg-python

seg = pkuseg.pkuseg()  # 以預設配置載入模型
text = seg.cut('我愛北京天安門')  # 進行分詞
print(text)  # ['我', '愛', '北京', '天安門']

詞雲配置項:https://blog.csdn.net/jinsefm/article/details/80645588