1. 程式人生 > >GloVe模型的理解及實踐(2)

GloVe模型的理解及實踐(2)

 

一、執行環境

Ubuntu16.04 + python 3.5

二、安裝gensim

兩種安裝方式

1)開啟終端

sudo easy_install --upgrade gensim

2)開啟終端

pip install gensim

三、Git官方GitHub程式碼

https://github.com/stanfordnlp/GloVe

四、生成詞向量

1.在glove檔案下開啟終端進行編譯:

make

編譯後生成 bin 資料夾,資料夾內有四個檔案:

Readme中有關於四個檔案的介紹。

1)vocab_count:計算原文字的單詞統計(生成vocab.txt檔案)

格式為“單詞 詞頻”如下圖:

2)cooccur:用於統計詞與詞的共現(生成二進位制檔案 cooccurrence.bin )

3)shuffle:生成二進位制檔案 cooccurrence.shuf.bin

4)glove:Glove演算法的訓練模型,生成vectors.txt和vectors.bin

2.執行 sh demo.sh

sh demo.sh  

如下圖,下載預設語料庫並訓練模型:

 

最後得到 vectors.txt

五、詞向量生成模型並載入

1.在目錄下建一個 load_model.py 檔案,程式碼如下

#!usr/bin/python
# -*- coding: utf-8 -*-
 
import shutil
import gensim  
def getFileLineNums(filename):  
    f = open(filename,'r')  
    count = 0  
  
    for line in f:  
          
        count += 1  
    return count
 
def prepend_line(infile, outfile, line):  
    """ 
    Function use to prepend lines using bash utilities in Linux. 
    (source: http://stackoverflow.com/a/10850588/610569) 
    """  
    with open(infile, 'r') as old:  
        with open(outfile, 'w') as new:  
            new.write(str(line) + "\n")  
            shutil.copyfileobj(old, new) 
      
def prepend_slow(infile, outfile, line):  
    """ 
    Slower way to prepend the line by re-creating the inputfile. 
    """  
    with open(infile, 'r') as fin:  
        with open(outfile, 'w') as fout:  
            fout.write(line + "\n")  
            for line in fin:  
                fout.write(line) 
 
def load(filename):  
      
    # Input: GloVe Model File  
    # More models can be downloaded from http://nlp.stanford.edu/projects/glove/  
    # glove_file="glove.840B.300d.txt"  
    glove_file = filename  
      
    dimensions = 50  
      
    num_lines = getFileLineNums(filename)  
    # num_lines = check_num_lines_in_glove(glove_file)  
    # dims = int(dimensions[:-1])  
    dims = 50  
      
    print num_lines  
        #  
        # # Output: Gensim Model text format.  
    gensim_file='glove_model.txt'  
    gensim_first_line = "{} {}".format(num_lines, dims)  
        #  
        # # Prepends the line.  
    #if platform == "linux" or platform == "linux2":  
    prepend_line(glove_file, gensim_file, gensim_first_line)  
    #else:  
    #    prepend_slow(glove_file, gensim_file, gensim_first_line)  
      
        # Demo: Loads the newly created glove_model.txt into gensim API.  
    model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file,binary=False) #GloVe Model  
      
    model_name = gensim_file[6:-4]  
          
    model.save('./' + model_name)  
      
    return model  
 
 #load(glove.6B.300d.txt)#生成模型
 
 
if __name__ == '__main__':  
    myfile=open('vectors.txt')
    myfile.read()
    
    ####################################
    model_name='model\.6B.300d'
    model = gensim.models.KeyedVectors.load('./'+model_name)  
 
    print len(model.vocab)  
 
    word_list = [u'girl',u'dog']  
   
    for word in word_list:  
        print word,'--'  
        for i in model.most_similar(word, topn=10):  
            print i[0],i[1]  
        print '' 

 

在目錄下終端執行

python load_model.py 

 

以上輸出結果為:

詞彙行數:400000

以及與 word_list = [u'girl',u'dog'] 最相近的Top 10 單詞

六、測試

測試程式碼如下:

import shutil
import gensim  
model_name='model\.6B.300d'
model = gensim.models.KeyedVectors.load('./'+model_name)  
print len(model.vocab)  
word_list = [u'person',u'pet']
for word in word_list:  
    print word,'--'  
    for i in model.most_similar(word, topn=10):       
        print i[0],i[1]  
        print '' 

結果:

person --
someone 0.690635979176

man 0.64434415102

anyone 0.627825558186

woman 0.617089629173

one 0.591174006462

actually 0.579971313477

persons 0.577681422234

people 0.571225821972

else 0.562521100044

somebody 0.560000300407

pet --
pets 0.686407566071

dog 0.629159927368

cat 0.58703649044

dogs 0.545046746731

cats 0.526196360588

animal 0.516855597496

animals 0.507143497467

puppy 0.486273795366

toy 0.430860459805

rabbits 0.420677244663

參考:https://blog.csdn.net/sscssz/article/details/53333225