使用gensim訓練中文語料word2vec
阿新 • • 發佈:2018-12-09
使用gensim訓練中文語料word2vec
目錄
1、專案目錄結構
1.1 檔案說明:
data:存放資料檔案,其中source資料夾是存放語料檔案,可以是多個檔案,segment資料夾是對應source資料夾存放字詞分割後的檔案
models:儲存gensim訓練後的模型檔案
segment.py: 用於分割中文字詞的Python檔案
word2vec.py: gensim訓練Python檔案
1.2 專案下載地址
專案下載地址:Github:https://github.com/PanJinquan/nlp-learning-tutorials/tree/master/word2vec ,“覺得不錯,給個Star哈”
2、使用jieba中文切詞工具進行切詞
使用《人民的名義》的小說原文作為語料,語料在這裡下載。拿到了語料,我們首先要進行分詞,這裡使用jieba分詞完成。
2.1 新增自定義詞典
- 開發者可以指定自己自定義的詞典,以便包含 jieba 詞庫裡沒有的詞。雖然 jieba 有新詞識別能力,但是自行新增新詞可以保證更高的正確率
- 用法: jieba.load_userdict(file_name) # file_name 為檔案類物件或自定義詞典的路徑
- 詞典格式和
dict.txt
一樣,一個詞佔一行;每一行分三部分:詞語、詞頻(可省略)、詞性(可省略),用空格隔開,順序不可顛倒。file_name
若為路徑或二進位制方式開啟的檔案,則檔案必須為 UTF-8 編碼。- 詞頻省略時使用自動計算的能保證分出該詞的詞頻。
例如:
沙瑞金 5
田國富 5
高育良 5
侯亮平 5
鍾小艾 5
陳岩石 5
歐陽菁 5
易學習 5
王大路 5
蔡成功 5
孫連城 5
季昌明 5
丁義珍 5
鄭西坡 5
趙東來 5
高小琴 5
趙瑞龍 5
林華華 5
陸亦可 5
劉新建 5
劉慶祝 5
2.2 新增停用詞
def getStopwords(path):
'''
載入停用詞
:param path:
:return:
'''
stopwords = []
with open(path, "r", encoding='utf8') as f:
lines = f.readlines()
for line in lines:
stopwords.append(line.strip())
return stopwords
2.3 jieba中文分詞
def segment_line(file_list,segment_out_dir,stopwords=[]):
'''
字詞分割,對每行進行字詞分割
:param file_list:
:param segment_out_dir:
:param stopwords:
:return:
'''
for i,file in enumerate(file_list):
segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
segment_file = open(segment_out_name, 'a', encoding='utf8')
with open(file, encoding='utf8') as f:
text = f.readlines()
for sentence in text:
# jieba.cut():引數sentence必須是str(unicode)型別
sentence = list(jieba.cut(sentence))
sentence_segment = []
for word in sentence:
if word not in stopwords:
sentence_segment.append(word)
segment_file.write(" ".join(sentence_segment))
del text
f.close()
segment_file.close()
def segment_lines(file_list,segment_out_dir,stopwords=[]):
'''
字詞分割,對整個檔案內容進行字詞分割
:param file_list:
:param segment_out_dir:
:param stopwords:
:return:
'''
for i,file in enumerate(file_list):
segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
with open(file, 'rb') as f:
document = f.read()
# document_decode = document.decode('GBK')
document_cut = jieba.cut(document)
sentence_segment=[]
for word in document_cut:
if word not in stopwords:
sentence_segment.append(word)
result = ' '.join(sentence_segment)
result = result.encode('utf-8')
with open(segment_out_name, 'wb') as f2:
f2.write(result)
2.4 完整程式碼和測試方法
# -*-coding: utf-8 -*-
"""
@Project: nlp-learning-tutorials
@File : segment.py
@Author : panjq
@E-mail : [email protected]
@Date : 2017-05-11 17:51:53
"""
##
import jieba
import os
from utils import files_processing
'''
read() 每次讀取整個檔案,它通常將讀取到底檔案內容放到一個字串變數中,也就是說 .read() 生成檔案內容是一個字串型別。
readline()每隻讀取檔案的一行,通常也是讀取到的一行內容放到一個字串變數中,返回str型別。
readlines()每次按行讀取整個檔案內容,將讀取到的內容放到一個列表中,返回list型別。
'''
def getStopwords(path):
'''
載入停用詞
:param path:
:return:
'''
stopwords = []
with open(path, "r", encoding='utf8') as f:
lines = f.readlines()
for line in lines:
stopwords.append(line.strip())
return stopwords
def segment_line(file_list,segment_out_dir,stopwords=[]):
'''
字詞分割,對每行進行字詞分割
:param file_list:
:param segment_out_dir:
:param stopwords:
:return:
'''
for i,file in enumerate(file_list):
segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
segment_file = open(segment_out_name, 'a', encoding='utf8')
with open(file, encoding='utf8') as f:
text = f.readlines()
for sentence in text:
# jieba.cut():引數sentence必須是str(unicode)型別
sentence = list(jieba.cut(sentence))
sentence_segment = []
for word in sentence:
if word not in stopwords:
sentence_segment.append(word)
segment_file.write(" ".join(sentence_segment))
del text
f.close()
segment_file.close()
def segment_lines(file_list,segment_out_dir,stopwords=[]):
'''
字詞分割,對整個檔案內容進行字詞分割
:param file_list:
:param segment_out_dir:
:param stopwords:
:return:
'''
for i,file in enumerate(file_list):
segment_out_name=os.path.join(segment_out_dir,'segment_{}.txt'.format(i))
with open(file, 'rb') as f:
document = f.read()
# document_decode = document.decode('GBK')
document_cut = jieba.cut(document)
sentence_segment=[]
for word in document_cut:
if word not in stopwords:
sentence_segment.append(word)
result = ' '.join(sentence_segment)
result = result.encode('utf-8')
with open(segment_out_name, 'wb') as f2:
f2.write(result)
if __name__=='__main__':
# 多執行緒分詞
# jieba.enable_parallel()
# 載入自定義詞典
user_path = 'data/user_dict.txt'
jieba.load_userdict(user_path)
stopwords_path='data/stop_words.txt'
stopwords=getStopwords(stopwords_path)
file_dir='data/source'
segment_out_dir='data/segment'
file_list=files_processing.get_files_list(file_dir,postfix='*.txt')
segment_lines(file_list, segment_out_dir,stopwords)
3、gensim訓練模型
# -*-coding: utf-8 -*-
"""
@Project: nlp-learning-tutorials
@File : word2vec_gensim.py
@Author : panjq
@E-mail : [email protected]
@Date : 2017-05-11 17:04:35
"""
from gensim.models import word2vec
import multiprocessing
def train_wordVectors(sentences, embedding_size = 128, window = 5, min_count = 5):
'''
:param sentences: sentences可以是LineSentence或者PathLineSentences讀取的檔案物件,也可以是
The `sentences` iterable can be simply a list of lists of tokens,如lists=[['我','是','中國','人'],['我','的','家鄉','在','廣東']]
:param embedding_size: 詞嵌入大小
:param window: 視窗
:param min_count:Ignores all words with total frequency lower than this.
:return: w2vModel
'''
w2vModel = word2vec.Word2Vec(sentences, size=embedding_size, window=window, min_count=min_count,workers=multiprocessing.cpu_count())
return w2vModel
def save_wordVectors(w2vModel,word2vec_path):
w2vModel.save(word2vec_path)
def load_wordVectors(word2vec_path):
w2vModel = word2vec.Word2Vec.load(word2vec_path)
return w2vModel
if __name__=='__main__':
# [1]若只有一個檔案,使用LineSentence讀取檔案
# segment_path='./data/segment/segment_0.txt'
# sentences = word2vec.LineSentence(segment_path)
# [1]若存在多檔案,使用PathLineSentences讀取檔案列表
segment_dir='./data/segment'
sentences = word2vec.PathLineSentences(segment_dir)
# 簡單的訓練
model = word2vec.Word2Vec(sentences, hs=1,min_count=1,window=3,size=100)
print(model.wv.similarity('沙瑞金', '高育良'))
# print(model.wv.similarity('李達康'.encode('utf-8'), '王大路'.encode('utf-8')))
# 一般訓練,設定以下幾個引數即可:
word2vec_path='./models/word2Vec.model'
model2=train_wordVectors(sentences, embedding_size=128, window=5, min_count=5)
save_wordVectors(model2,word2vec_path)
model2=load_wordVectors(word2vec_path)
print(model2.wv.similarity('沙瑞金', '高育良'))
執行結果:
0.968616
0.994922