使用gensim訓練中文語料word2vec

阿新 • • 發佈：2018-12-09

使用gensim訓練中文語料word2vec

1、專案目錄結構

1、專案目錄結構

1.1 檔案說明：

data:存放資料檔案，其中source資料夾是存放語料檔案，可以是多個檔案，segment資料夾是對應source資料夾存放字詞分割後的檔案

models:儲存gensim訓練後的模型檔案

segment.py: 用於分割中文字詞的Python檔案

word2vec.py: gensim訓練Python檔案

1.2 專案下載地址

專案下載地址：Github：https://github.com/PanJinquan/nlp-learning-tutorials/tree/master/word2vec ，“覺得不錯，給個Star哈”

2、使用jieba中文切詞工具進行切詞

　使用《人民的名義》的小說原文作為語料，語料在這裡下載。拿到了語料，我們首先要進行分詞，這裡使用jieba分詞完成。

2.1 新增自定義詞典

開發者可以指定自己自定義的詞典，以便包含 jieba 詞庫裡沒有的詞。雖然 jieba 有新詞識別能力，但是自行新增新詞可以保證更高的正確率

用法： jieba.load_userdict(file_name) # file_name 為檔案類物件或自定義詞典的路徑

詞典格式和 dict.txt 一樣，一個詞佔一行；每一行分三部分：詞語、詞頻（可省略）、詞性（可省略），用空格隔開，順序不可顛倒。file_name 若為路徑或二進位制方式開啟的檔案，則檔案必須為 UTF-8 編碼。

詞頻省略時使用自動計算的能保證分出該詞的詞頻。

例如：

沙瑞金 5
田國富 5
高育良 5
侯亮平 5
鍾小艾 5
陳岩石 5
歐陽菁 5
易學習 5
王大路 5
蔡成功 5
孫連城 5
季昌明 5
丁義珍 5
鄭西坡 5
趙東來 5
高小琴 5
趙瑞龍 5
林華華 5
陸亦可 5
劉新建 5
劉慶祝 5

2.2 新增停用詞

def getStopwords(path):
    '''
    載入停用詞
    :param path:
    :return:
    '''
    stopwords = []
    with open(path, "r", encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            stopwords.append(line.strip())
    return stopwords

2.3 jieba中文分詞

def segment_line(file_list,segment_out_dir,stopwords=[]):
    '''
    字詞分割，對每行進行字詞分割
    :param file_list:
    :param segment_out_dir:
    :param stopwords:
    :return:
    '''
    for i,file in enumerate(file_list):
        segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
        segment_file = open(segment_out_name, 'a', encoding='utf8')
        with open(file, encoding='utf8') as f:
            text = f.readlines()
            for sentence in text:
                # jieba.cut():引數sentence必須是str(unicode)型別
                sentence = list(jieba.cut(sentence))
                sentence_segment = []
                for word in sentence:
                    if word not in stopwords:
                        sentence_segment.append(word)
                segment_file.write(" ".join(sentence_segment))
            del text
            f.close()
        segment_file.close()

def segment_lines(file_list,segment_out_dir,stopwords=[]):
    '''
    字詞分割，對整個檔案內容進行字詞分割
    :param file_list:
    :param segment_out_dir:
    :param stopwords:
    :return:
    '''
    for i,file in enumerate(file_list):
        segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
        with open(file, 'rb') as f:
            document = f.read()
            # document_decode = document.decode('GBK')
            document_cut = jieba.cut(document)
            sentence_segment=[]
            for word in document_cut:
                if word not in stopwords:
                    sentence_segment.append(word)
            result = ' '.join(sentence_segment)
            result = result.encode('utf-8')
            with open(segment_out_name, 'wb') as f2:
                f2.write(result)

2.4 完整程式碼和測試方法

# -*-coding: utf-8 -*-
"""
    @Project: nlp-learning-tutorials
    @File   : segment.py
    @Author : panjq
    @E-mail : [email protected]
    @Date   : 2017-05-11 17:51:53
"""

##
import jieba
import os
from utils import files_processing

'''
read() 每次讀取整個檔案，它通常將讀取到底檔案內容放到一個字串變數中，也就是說 .read() 生成檔案內容是一個字串型別。
readline()每隻讀取檔案的一行，通常也是讀取到的一行內容放到一個字串變數中，返回str型別。
readlines()每次按行讀取整個檔案內容，將讀取到的內容放到一個列表中，返回list型別。
'''
def getStopwords(path):
    '''
    載入停用詞
    :param path:
    :return:
    '''
    stopwords = []
    with open(path, "r", encoding='utf8') as f:
        lines = f.readlines()
        for line in lines:
            stopwords.append(line.strip())
    return stopwords

def segment_line(file_list,segment_out_dir,stopwords=[]):
    '''
    字詞分割，對每行進行字詞分割
    :param file_list:
    :param segment_out_dir:
    :param stopwords:
    :return:
    '''
    for i,file in enumerate(file_list):
        segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
        segment_file = open(segment_out_name, 'a', encoding='utf8')
        with open(file, encoding='utf8') as f:
            text = f.readlines()
            for sentence in text:
                # jieba.cut():引數sentence必須是str(unicode)型別
                sentence = list(jieba.cut(sentence))
                sentence_segment = []
                for word in sentence:
                    if word not in stopwords:
                        sentence_segment.append(word)
                segment_file.write(" ".join(sentence_segment))
            del text
            f.close()
        segment_file.close()

def segment_lines(file_list,segment_out_dir,stopwords=[]):
    '''
    字詞分割，對整個檔案內容進行字詞分割
    :param file_list:
    :param segment_out_dir:
    :param stopwords:
    :return:
    '''
    for i,file in enumerate(file_list):
        segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
        with open(file, 'rb') as f:
            document = f.read()
            # document_decode = document.decode('GBK')
            document_cut = jieba.cut(document)
            sentence_segment=[]
            for word in document_cut:
                if word not in stopwords:
                    sentence_segment.append(word)
            result = ' '.join(sentence_segment)
            result = result.encode('utf-8')
            with open(segment_out_name, 'wb') as f2:
                f2.write(result)


if __name__=='__main__':


    # 多執行緒分詞
    # jieba.enable_parallel()
    # 載入自定義詞典
    user_path = 'data/user_dict.txt'
    jieba.load_userdict(user_path)

    stopwords_path='data/stop_words.txt'
    stopwords=getStopwords(stopwords_path)

    file_dir='data/source'
    segment_out_dir='data/segment'
    file_list=files_processing.get_files_list(file_dir,postfix='*.txt')
    segment_lines(file_list, segment_out_dir,stopwords)

3、gensim訓練模型

# -*-coding: utf-8 -*-
"""
    @Project: nlp-learning-tutorials
    @File   : word2vec_gensim.py
    @Author : panjq
    @E-mail : [email protected]
    @Date   : 2017-05-11 17:04:35
"""

from gensim.models import word2vec
import multiprocessing

def train_wordVectors(sentences, embedding_size = 128, window = 5, min_count = 5):
    '''

    :param sentences: sentences可以是LineSentence或者PathLineSentences讀取的檔案物件，也可以是
                    The `sentences` iterable can be simply a list of lists of tokens,如lists=[['我','是','中國','人'],['我','的','家鄉','在','廣東']]
    :param embedding_size: 詞嵌入大小
    :param window: 視窗
    :param min_count:Ignores all words with total frequency lower than this.
    :return: w2vModel
    '''
    w2vModel = word2vec.Word2Vec(sentences, size=embedding_size, window=window, min_count=min_count,workers=multiprocessing.cpu_count())
    return w2vModel

def save_wordVectors(w2vModel,word2vec_path):
    w2vModel.save(word2vec_path)

def load_wordVectors(word2vec_path):
    w2vModel = word2vec.Word2Vec.load(word2vec_path)
    return w2vModel

if __name__=='__main__':

    # [1]若只有一個檔案，使用LineSentence讀取檔案
    # segment_path='./data/segment/segment_0.txt'
    # sentences = word2vec.LineSentence(segment_path)

    # [1]若存在多檔案，使用PathLineSentences讀取檔案列表

    segment_dir='./data/segment'
    sentences = word2vec.PathLineSentences(segment_dir)

    # 簡單的訓練
    model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)
    print(model.wv.similarity('沙瑞金', '高育良'))
    # print(model.wv.similarity('李達康'.encode('utf-8'), '王大路'.encode('utf-8')))

    # 一般訓練，設定以下幾個引數即可：
    word2vec_path='./models/word2Vec.model'
    model2=train_wordVectors(sentences, embedding_size=128, window=5, min_count=5)
    save_wordVectors(model2,word2vec_path)
    model2=load_wordVectors(word2vec_path)
    print(model2.wv.similarity('沙瑞金', '高育良'))

執行結果：

0.968616
0.994922

使用gensim訓練中文語料word2vec

使用gensim訓練中文語料word2vec 目錄使用gensim訓練中文語料word2vec 1、專案目錄結構 1.1 檔案說明： 1.2 專案下載地址 2、使用jieba中文切詞工具進行切詞 2.1 新增自定義詞典 2.2 新增停

gensim 中文語料訓練 word2vec

1 word2vec api 看下api： gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vo

利用中文維基語料和Gensim訓練 Word2Vec 的步驟

word2vec 包括CBOW 和 Skip-gram，它的相關原理網上很多，這裡就不多說了。簡單來說，word2vec是自然語言中的字詞轉為計算機可以理解的稠密向量，是one-hot詞彙表的降維表示，代表每個詞的特徵以及保持住了詞彙間的關係。此處記錄將中文詞彙

【python gensim使用】word2vec詞向量處理中文語料

word2vec介紹 word2vec官網：https://code.google.com/p/word2vec/ word2vec是google的一個開源工具，能夠根據輸入的詞的集合計算出詞與詞之間的距離。它將term轉換成向量形式，可以把對文字內容的處理簡化為向量空間中的向量運算，計算出向

【使用者行為分析】用wiki百科中文語料訓練word2vec模型

前言最近在調研基於內容的使用者行為分析，在過程中發現了word2vec這個很有幫助的演算法。word2vec，顧名思義是將詞語（word）轉化為向量（vector）的的工具。產自Google，於2013年開源。在向量模型中，我們可以做基於相似度（向量距離/

windows10 訓練word2vec 中文語料

windows10 環境訓練word2vec 中文語料概述本人是NLP中的菜鳥，喜歡這個領域，自己論文打算做這方面，訓練word2vec是每一項NLP工作的基礎內容。形成詞向量直接用於神經網路的輸入層，也可以作為輔助特徵擴充套件現有模型，提高

用gensim對中文維基百科語料上的word2Vec相似度計算實驗

Word2vec 是Google在 2013年年中開源的一款將詞表徵為實數值向量的高效工具,其利用深度學習的思想，可以通過訓練，把對文字內容的處理簡化為 K 維向量空間中的向量運算，而向量空間上的相似度可以用來表示文字語義上的相似度。Word2vec輸出的詞向量可以被用來

word2vec詞向量處理中文語料

Gensim訓練維基百科語料庫

說明最終的模型檔案：連結：https://pan.baidu.com/s/1acGhejPCw98Mx4iKozVZdw 提取碼：vsm1 原始碼github地址：https://github.com/datadevsh/wiki-gensim-word2vector 如果遇到編碼

win7 python3.5 採用gensim訓練word2vec，生成wiki.zh.text.model

0，如果您覺得操作麻煩，可以直接直接下載生成好的wiki.zh.text.model模型 https://download.csdn.net/download/luolinll1212/10640451 1，下載中文維基百科 https://

gensim訓練word2vec和doc2vec

word2vec和doc2vec是做NLP過程中經常使用的方法。用向量表示詞彙這種做法由來已久，最早使用的是one-hot向量，即只有一個維度為1，其餘維度都為0，但這種做法有很多缺陷，過多的維度會導致資料處理的困難，而且這種表示方法無法體現詞所在的上下文關係。於是便有了wo

windows下訓練中文維基百科的word2vec

包括四個步驟：1）下載中文維基百科語料；2）利用opencc進行繁簡轉換；3）對語料分詞；4）利用gensim訓練詞向量 1）中文維基百科 #!/usr/bin/env python # -*- coding: utf-8 -*- import logging impo

【深度學習】120G+訓練好的word2vec模型（中文詞向量）

很多人缺少大語料訓練的word2vec模型，在此分享下使用268G+語料訓練好的word2vec模型。訓練語料：百度百科800w+條，26G+ 搜狐新聞400w+條，13G+ 小說：229G+ image.png 模型引數： window=5

python︱gensim訓練word2vec及相關函式與功能理解

一、gensim介紹 gensim是一款強大的自然語言處理工具，裡面包括N多常見模型：基本的語料處理工具 LSI LDA HDP DTM DIM TF-IDF word2vec、paragraph2vec . 二、訓練模型 1、訓練最簡單的訓練方

wikipedia 訓練繁體中文 embedding(word2vec)模型

由於課題任務需要一個繁體中文的word3vec, 折騰經過記錄在此。希望以後少掉幾個坑。訓練好的embedding放在網盤中，密碼：2um0 後來又按照這個方法訓練了簡體中文維度分別為50、100、200、300的embedding，一併放出來網盤連結密碼

python使用gensim訓練搜狗語料的LDA

# -*- coding: utf-8 -*- import jieba, os import codecs from gensim import corpora, models, similarities from pprint import pprint from co

Spark應用HanLP對中文語料進行文字挖掘--聚類詳解教程

軟體：IDEA2014、Maven、HanLP、JDK；用到的知識：HanLP、Spark TF-IDF、Spark kmeans、Spark mapPartition; 用到的資料集：http://www.threedweb.cn/thread-1288-1-1.html（不需要下載，已

python wiki中文語料分詞

上篇將wiki中文語料已經下載下來（wiki中文文字語料下載並處理 ubuntu + python2.7），並且轉為了txt格式，本篇對txt檔案進行分詞，分詞後才能使用word2vector訓練詞向量

Gensim做中文主題模型（LDA)

中文語料來自http://www.sogou.com/labs/dl/c.html 的精簡版（tar.gz格式） 24M [email protected]:/u01/jerry/Reduced$ ls C000008 C000010 C000013 C00

python下進行lda主題挖掘(二)——利用gensim訓練LDA模型

到2018年3月7日為止，本系列三篇文章已寫完，可能後續有新的內容的話會繼續更新。本篇是我的LDA主題挖掘系列的第二篇，介紹如何利用gensim包提供的方法來訓練自己處理好的語料。 gensim提供了多種方法：速度較慢的：

使用gensim訓練中文語料word2vec

使用gensim訓練中文語料word2vec

1、專案目錄結構

1.1 檔案說明：

1.2 專案下載地址

2、使用jieba中文切詞工具進行切詞

2.1 新增自定義詞典

2.2 新增停用詞

2.3 jieba中文分詞

2.4 完整程式碼和測試方法

3、gensim訓練模型

相關推薦