短文字轉向量的一種實現方式

阿新 • • 發佈：2018-12-12

文章目錄

前言

下文實現僅僅是比較粗糙的一種方式，可以改進的點還有很多，是真的很多！重點是，不講解原理，就是這麼沒道理…

實現思路

分詞。分詞還是jieba好。word2vec模型訓練選取gensim。
使用大語料進行基礎詞典word2vec模型的訓練。
使用特定領域（針對業務）語料進行專業詞彙word2vec模型的訓練。
文字分詞後使用AVG-W2V方式獲取短文字向量，維度取決於word2vec維度大小，即所有詞向量求平均。

word2vec相關配置

w2v.properties

#一些經驗
#架構（sg）：skip-gram（慢、對罕見字有利）vs CBOW（快）
#訓練演算法（hs）：分層softmax（對罕見字有利）vs 負取樣（對常見詞和低緯向量有利）
#欠取樣頻繁詞（sample）：可以提高結果的準確性和速度（適用範圍1e-3到1e-5）
#文字大小（window）：skip-gram通常在10附近，CBOW通常在5附近
#大語料下，建議提高min_count，減少iter

# 訓練演算法，0為CBOW演算法，1為skip-gram演算法，預設為0
sg=1
# 特徵向量的維度
size=300
# 詞窗大小
window=5
# 最小詞頻
min_count=5
# 初始學習速率
alpha=0.025
# 0為負取樣，1為softmax，預設為0
hs=1
#迭代次數
iter=10

程式碼

大語料基礎訓練相關程式碼

# -*- coding:utf-8 -*-
"""
Description: 基於百度百科大語料的word2vec模型

@author: WangLeAi
@date: 2018/9/18
"""
import os
from util.DBUtil import DbPoolUtil
from util.JiebaUtil import jieba_util
from util.PropertiesUtil import prop
from gensim.models import word2vec


class OriginModel(object) 
:
    def __init__(self):
        self.params = prop.get_config_dict("config/w2v.properties")
        self.db_pool_util = DbPoolUtil(db_type="mysql")
        self.train_data_path = "gen/ori_train_data.txt"
        self.model_path = "model/oriw2v.model"

    @staticmethod
    def text_process(sentence): 

        """
        文字預處理
        :param sentence:
        :return:
        """
        # 過濾任意非中文、非英文、非數字
        # regex = re.compile(u'[^\u4e00-\u9fa50-9a-zA-Z\-·]+')
        # sentence = regex.sub('', sentence)
        words = jieba_util.jieba_cut(sentence)
        return words

    def get_train_data(self):
        """
        獲取訓練資料，此處需要自行修改，最好寫入檔案而不是直接取到記憶體中！！！！！
        :return:
        """
        print("建立初始語料訓練資料")
        sql = """ """
        sentences = self.db_pool_util.loop_row(origin_model, "text_process", sql)
        with open(self.train_data_path, "w", encoding="utf-8") as f:
            for sentence in sentences:
                f.write(" ".join(sentence) + "\n")

    def train_model(self):
        """
        訓練模型
        :return:
        """
        if not os.path.exists(self.train_data_path):
            self.get_train_data()
        print("訓練初始模型")
        sentences = word2vec.LineSentence(self.train_data_path)
        model = word2vec.Word2Vec(sentences=sentences, sg=int(self.params["sg"]), size=int(self.params["size"]),
                                  window=int(self.params["window"]), min_count=int(self.params["min_count"]),
                                  alpha=float(self.params["alpha"]), hs=int(self.params["hs"]), workers=6,
                                  iter=int(self.params["iter"]))
        model.save(self.model_path)
        print("訓練初始模型完畢，儲存模型")


origin_model = OriginModel()

額外語料進行訓練

# -*- coding:utf-8 -*-
"""
Description:word2vec fine tuning
基於對應型別的額外語料進行微調

@author: WangLeAi
@date: 2018/9/11
"""
import os
from util.DBUtil import DbPoolUtil
from util.JiebaUtil import jieba_util
from util.PropertiesUtil import prop
from gensim.models import word2vec
from algorithms.OriginModel import origin_model


class Word2VecModel(object):
    def __init__(self):
        self.db_pool_util = DbPoolUtil(db_type="mysql")
        self.train_data_path = "gen/train_data.txt"
        self.origin_model_path = "model/oriw2v.model"
        self.model_path = "model/w2v.model"
        self.model = None
        # 未登入詞進入需考慮最小詞頻
        self.min_count = int(prop.get_config_value("config/w2v.properties", "min_count"))

    @staticmethod
    def text_process(sentence):
        """
        文字預處理
        :param sentence:
        :return:
        """
        # 過濾任意非中文、非英文、非數字等
        # regex = re.compile(u'[^\u4e00-\u9fa50-9a-zA-Z\-·]+')
        # sentence = regex.sub('', sentence)
        words = jieba_util.jieba_cut(sentence)
        return words

    def get_train_data(self):
        """
        獲取訓練資料,此處需要自行修改，最好寫入檔案而不是直接取到記憶體中！！！！！
        :return:
        """
        print("建立額外語料訓練資料")
        sql = """ """
        sentences = self.db_pool_util.loop_row(w2v_model, "text_process", sql)
        with open(self.train_data_path, "a", encoding="utf-8") as f:
            for sentence in sentences:
                f.write(" ".join(sentence) + "\n")

    def train_model(self):
        """
        訓練模型
        :return:
        """
        if not os.path.exists(self.origin_model_path):
            print("無初始模型，進行初始模型訓練")
            origin_model.train_model()
        model = word2vec.Word2Vec.load(self.origin_model_path)
        print("初始模型載入完畢")
        if not os.path.exists(self.train_data_path):
            self.get_train_data()
        print("額外語料訓練")
        extra_sentences = word2vec.LineSentence(self.train_data_path)
        model.build_vocab(extra_sentences, update=True)
        model.train(extra_sentences, total_examples=model.corpus_count, epochs=model.iter)
        model.save(self.model_path)
        print("額外語料訓練完畢")

    def load_model(self):
        """
        載入模型
        :return:
        """
        print("載入詞嵌入模型")
        if not os.path.exists(self.model_path):
            print("無詞嵌入模型，進行訓練")
            self.train_model()
        self.model = word2vec.Word2Vec.load(self.model_path)
        print("詞嵌入模型載入完畢")

    def get_word_vector(self, words, extra=0):
        """
        獲取詞語向量，需要先載入模型
        :param words:
        :param extra:是否考慮未登入詞，0不考慮，1考慮
        :return:
        """
        if extra:
            if words not in self.model:
                more_sentences = [[words, ] for i in range(self.min_count)]
                self.model.build_vocab(more_sentences, update=True)
                self.model.train(more_sentences, total_examples=self.model.corpus_count, epochs=self.model.iter)
                self.model.save(self.model_path)
        rst = None
        if words in self.model:
            rst = self.model[words]
        return rst

    def get_sentence_vector(self, sentence, extra=0):
        """
        獲取文字向量，需要先載入模型
        :param sentence:
        :param extra: 是否考慮未登入詞，0不考慮，1考慮
        :return:
        """
        words = jieba_util.jieba_cut_flag(sentence)
        if not words:
            words = jieba_util.jieba_cut(sentence)
        if not words:
            print("存在無法切出有效詞的句子：" + sentence)
            # raise Exception("存在無法切出有效詞的句子：" + sentence)
        if extra:
            for item in words:
                if item not in self.model:
                    more_sentences = [words for i in range(self.min_count)]
                    self.model.build_vocab(more_sentences, update=True)
                    self.model.train(more_sentences, total_examples=self.model.corpus_count, epochs=self.model.iter)
                    self.model.save(self.model_path)
                    break
        return self.get_sentence_embedding(words)

    def get_sentence_embedding(self, words):
        """
        獲取短文字向量，僅推薦短文字使用
        句中所有詞權重總和求平均獲取文字向量，不適用於長文字的原因在於受頻繁詞影響較大
        長文字推薦使用gensim的doc2vec
        :param words:
        :return:
        """
        count = 0
        vector = None
        for item in words:
            if item in self.model:
                count += 1
                if vector is not None:
                    vector = vector + self.model[item]
                else:
                    vector = self.model[item]
        if vector is not None:
            vector = vector / count
        return vector


w2v_model = Word2VecModel()

測試方式

# -*- coding:utf-8 -*-
"""
Description:

@author: WangLeAi
@date: 2018/9/18
"""
import os
from algorithms.Word2VecModel import w2v_model


def main():
    root_path = os.path.split(os.path.realpath(__file__))[0]
    if not os.path.exists(root_path + "/model"):
        os.mkdir(root_path + "/model")
    w2v_model.load_model()
    print(w2v_model.get_sentence_vector("不知不覺間我已經忘記了愛"))


if __name__ == "__main__":
    main()

補充資料

文字相似度演算法相關資料（力推！）：戳我
DBUtils相關內容可以看我之前的博文，有一點小改動：戳我

完整程式碼

下載地址

短文字轉向量的一種實現方式

文章目錄前言下文實現僅僅是比較粗糙的一種方式，可以改進的點還有很多，是真的很多！重點是，不講解原理，就是這麼沒道理… 實現思路分詞。分詞還是jieba好。word2vec模型訓練選取gensim。使用大語料進行基礎詞典word2vec模型的訓練。使用

byte陣列轉成16進位制字串的一種實現方式

public String bytes2HexStr(byte[] byteArr) { String hexString = "0123456789ABCDEF"; StringBuilder sb = new StringBuilder(byteArr.lengt

【轉】Java併發問題--樂觀鎖與悲觀鎖以及樂觀鎖的一種實現方式-CAS

首先介紹一些樂觀鎖與悲觀鎖: 悲觀鎖：總是假設最壞的情況，每次去拿資料的時候都認為別人會修改，所以每次在拿資料的時候都會上鎖，這樣別人想拿這個資料就會阻塞直到它拿到鎖。傳統的關係型資料庫裡邊就用到了很多這種鎖機制，比如行鎖，表鎖等，讀鎖，寫鎖等，都是在做操作之前先上

Python3中socket的一種實現方式

div reply auth email str 兩個 env ini 字符串 #!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2017-06-09 22:57 # @Author : wlgo210

樂觀鎖的一種實現方式——CAS

www. 提升中一 num 對象用戶 ace 另一個 nbsp 原文出處： hollischuang (@Hollis_Chuang) 在深入理解樂觀鎖與悲觀鎖一文中我們介紹過鎖。本文在這篇文章的基礎上，深入分析一下樂觀鎖的實現機制，介紹什麽是CAS、CAS的應用以及C

Java並發問題--樂觀鎖與悲觀鎖以及樂觀鎖的一種實現方式-CAS

RF -- 指針 locking water 更多錯誤創建判斷首先介紹一些樂觀鎖與悲觀鎖: 悲觀鎖：總是假設最壞的情況，每次去拿數據的時候都認為別人會修改，所以每次在拿數據的時候都會上鎖，這樣別人想拿這個數據就會阻塞直到它拿到鎖。傳統的關系型數據庫裏邊就用到了很多這

動態內表的一種實現方式

loop days assign pla eat -name alc str ack SPAN { font-family: "Courier New"; font-size: 10pt; color: #000000; background: #FFFFFF } .L0S

c# 數字ID與可見字串碼互轉的一種實現

c# 數字ID與可見字串碼互轉的一種實現適用場景：有時使用者id等之類的欄位用的是int型別，但在有些時候不想讓這個id暴露，於是可以考慮把這個id轉換成一個字串，而且要可根據這個字串得到相應的id值實現如下程式碼： using System; using System.Data; usi

java 多執行緒的一種實現方式

private ThreadPoolExecutor threadPoolExecutor; /** * 獲取執行緒池 * @return */ private ThreadPoolExecutor getThreadPoolExecutor(){

Linux 程序間通訊的一種實現方式

程序間通訊的方式一般有三種：管道（普通管道、流管道和命名管道）、系統IPC（訊息佇列、訊號和共享儲存）、套接字（Socket）本部落格以之前所做的智慧車專案為例，簡單介紹下共享儲存的一種實現過程。簡單說來，程序1將資料寫入到一個公共檔案input.txt中，程序2對其進行

lua中class的一種實現方式，單例擴充

方式 tab 方法 span 實用 ble 核心攔截說我先上代碼 1 local _class={} 2 3 function class(super,singleton) 4 local class_type={} 5

載入一個類時，其內部類是否同時被載入？引申出單例模式的另一種實現方式...

載入一個類時，其內部類是否同時被載入？下面我們做一個實驗來看一下。public class Outer { static { System.out.println("load outer class..."); } //靜態內部類 sta

載入一個類時，其內部類是否同時被載入？引申出單例模式的另一種實現方式

載入一個類時，其內部類是否同時被載入？下面我們做一個實驗來看一下。 Java程式碼 1. public class Outer { 2. static { 3. System.out.println("load o

Android之動態修改system/etc目錄下檔案的一種實現方式-SELinux

在沒有root的前提下，system分割槽為只讀，若要動態修改該分割槽下的檔案，可以按照下面流程實現： 1.寫執行指令碼，這裡以修改system/etc/hosts檔案為例，在/device/mediatek/mt67xx目錄下建立名為modifyhosts.sh的檔案，檔

apk自我保護的一種實現方式——執行時自篡改dalvik指令

玩過Android開發的人應該都知道，Android apk的保護是非常差的，辛辛苦苦寫的程式碼，被別人翻個底朝天倒不說，被人改了程式碼移頭換面再拿出來害人就不能忍了。除自帶的SDK外，Android的分析和修改工具還有很多，Android下的靜態分析工具，最常見的是利

a標籤不跳轉的幾種實現方式

a標籤不跳轉一共收集了3種方法 1、onclick事件中返回false 不能跳轉的寫法及demo <a href="http://www.baidu.com" onclick="r

Java併發問題--樂觀鎖與悲觀鎖以及樂觀鎖的一種實現方式-CAS

首先介紹一些樂觀鎖與悲觀鎖: 　　悲觀鎖：總是假設最壞的情況，每次去拿資料的時候都認為別人會修改，所以每次在拿資料的時候都會上鎖，這樣別人想拿這個資料就會阻塞直到它拿到鎖。傳統的關係型資料庫裡邊就用到了很多這種鎖機制，比如行鎖，表鎖等，讀鎖，寫鎖等，都是在做操作之前先上鎖。

CSS-蜂窩狀展示區域（多個六邊形）的一種實現方式

lis child form 兩個說明多個組合 hid clas 網上已經有很多關於正六邊形的CSS畫法，主要是利用一個矩形和前後的兩個三角形組合而成。之前在看四維圖新的官網的時候，發現了一種六邊形的畫法，比較適合多排六邊形組合成蜂窩狀的展示區域（註：四維圖新現在改

嘗試新思路——CError的另一種實現方式

程式碼如下： #ifndef __MYERROR_H__ #define __MYERROR_H__ #include "Error.h" #include <map> #include

代理的另一種實現方式

代理相信大家都很熟悉了。不過還是說下吧。舉個例子： // // A.h // Created by XX // @protocol SomeDelegate <NSObject> - (void)someMethod:(UIVi

短文字轉向量的一種實現方式

文章目錄

前言

實現思路

word2vec相關配置

程式碼

補充資料

完整程式碼

相關推薦