windows下訓練中文維基百科的word2vec
阿新 • • 發佈:2019-01-10
包括四個步驟:1)下載中文維基百科語料;2)利用opencc進行繁簡轉換;3)對語料分詞;4)利用gensim訓練詞向量
1)中文維基百科
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print globals()['__doc__'] % locals() sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles") output.close() logger.info("Finished Saved " + str(i) + " articles")
cd到process_wiki.py目錄,執行:
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
2)繁簡轉換
下載opencc並解壓,將wiki.zh.text拷貝到opencc-0.4.2目錄下,開啟cmd進入該目錄,執行:
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
3)jieba分詞
該步驟包含分詞,剔除標點符號和去文章結構標識。(建議word2vec訓練資料不要去除標點符號,比如在情感分析應用中標點符號很有用)最終將得到分詞好的純文字檔案,每行對應一篇文章,詞語間以空格作為分隔符。script_seg.py如下:
#!/usr/bin/python # -*- coding: utf-8 -*- import sys, codecs import jieba.posseg as pseg reload(sys) sys.setdefaultencoding('utf-8') if __name__ == '__main__': if len(sys.argv) < 3: print "Usage: python script.py infile outfile" sys.exit() i = 0 infile, outfile = sys.argv[1:3] output = codecs.open(outfile, 'w', 'utf-8') with codecs.open(infile, 'r', 'utf-8') as myfile: for line in myfile: line = line.strip() if len(line) < 1: continue if line.startswith('<doc'): i = i + 1 if(i % 1000 == 0): print('Finished ' + str(i) + ' articles') continue if line.startswith('</doc'): output.write('\n') continue words = pseg.cut(line) for word, flag in words: if flag.startswith('x'): continue output.write(word + ' ') output.close() print('Finished ' + str(i) + ' articles')
將wiki.zh.text.jian和script_seg.py拷貝到同一目錄,在cmd中執行:
python script_seg.py wiki.zh.text.jian zh.wiki.text
4)訓練詞向量
這裡選擇的是Python版的word2vec。script_train.py如下:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import os, sys, codecs
import gensim, logging, multiprocessing
reload(sys)
sys.setdefaultencoding('utf-8')
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: python script.py infile outfile"
sys.exit()
infile, outfile = sys.argv[1:3]
model = gensim.models.Word2Vec(gensim.models.word2vec.LineSentence(infile), size=200, window=5, min_count=5, sg=0, workers=multiprocessing.cpu_count())
#model.save(outfile)
model.wv.save_word2vec_format(outfile + '.bin', binary=True)
執行:
python script_train.py zh.wiki.text zh.wiki.model
載入bin檔案,檢視詞向量訓練效果:
#encoding=utf-8
from gensim.models import word2vec
model= word2vec.KeyedVectors.load_word2vec_format("./zh.wiki.model.bin", binary=True)
result = model.most_similar(u"足球")
for i in result:
print i[0],i[1]
文章轉自: