GloVe模型的理解及實踐（2）

阿新 • • 發佈：2018-11-10

一、執行環境

Ubuntu16.04 + python 3.5

二、安裝gensim

兩種安裝方式

1）開啟終端

sudo easy_install --upgrade gensim

2）開啟終端

pip install gensim

三、Git官方GitHub程式碼

https://github.com/stanfordnlp/GloVe

四、生成詞向量

1.在glove檔案下開啟終端進行編譯：

make

編譯後生成 bin 資料夾，資料夾內有四個檔案：

Readme中有關於四個檔案的介紹。

1）vocab_count:計算原文字的單詞統計（生成vocab.txt檔案）

格式為“單詞詞頻”如下圖：

2）cooccur：用於統計詞與詞的共現（生成二進位制檔案 cooccurrence.bin ）

3）shuffle：生成二進位制檔案 cooccurrence.shuf.bin

4）glove：Glove演算法的訓練模型，生成vectors.txt和vectors.bin

2.執行 sh demo.sh

sh demo.sh

如下圖，下載預設語料庫並訓練模型：

最後得到 vectors.txt

五、詞向量生成模型並載入

1.在目錄下建一個 load_model.py 檔案，程式碼如下

#!usr/bin/python
# -*- coding: utf-8 -*-
 
import shutil
import gensim  
def getFileLineNums(filename):  
    f = open(filename,'r')  
    count = 0  
  
    for line in f:  
          
        count += 1  
    return count
 
def prepend_line(infile, outfile, line):  
    """ 
    Function use to prepend lines using bash utilities in Linux. 
    (source: http://stackoverflow.com/a/10850588/610569) 
    """  
    with open(infile, 'r') as old:  
        with open(outfile, 'w') as new:  
            new.write(str(line) + "\n")  
            shutil.copyfileobj(old, new) 
      
def prepend_slow(infile, outfile, line):  
    """ 
    Slower way to prepend the line by re-creating the inputfile. 
    """  
    with open(infile, 'r') as fin:  
        with open(outfile, 'w') as fout:  
            fout.write(line + "\n")  
            for line in fin:  
                fout.write(line) 
 
def load(filename):  
      
    # Input: GloVe Model File  
    # More models can be downloaded from http://nlp.stanford.edu/projects/glove/  
    # glove_file="glove.840B.300d.txt"  
    glove_file = filename  
      
    dimensions = 50  
      
    num_lines = getFileLineNums(filename)  
    # num_lines = check_num_lines_in_glove(glove_file)  
    # dims = int(dimensions[:-1])  
    dims = 50  
      
    print num_lines  
        #  
        # # Output: Gensim Model text format.  
    gensim_file='glove_model.txt'  
    gensim_first_line = "{} {}".format(num_lines, dims)  
        #  
        # # Prepends the line.  
    #if platform == "linux" or platform == "linux2":  
    prepend_line(glove_file, gensim_file, gensim_first_line)  
    #else:  
    #    prepend_slow(glove_file, gensim_file, gensim_first_line)  
      
        # Demo: Loads the newly created glove_model.txt into gensim API.  
    model=gensim.models.KeyedVectors.load_word2vec_format(gensim_file,binary=False) #GloVe Model  
      
    model_name = gensim_file[6:-4]  
          
    model.save('./' + model_name)  
      
    return model  
 
 #load(glove.6B.300d.txt)#生成模型
 
 
if __name__ == '__main__':  
    myfile=open('vectors.txt')
    myfile.read()
    
    ####################################
    model_name='model\.6B.300d'
    model = gensim.models.KeyedVectors.load('./'+model_name)  
 
    print len(model.vocab)  
 
    word_list = [u'girl',u'dog']  
   
    for word in word_list:  
        print word,'--'  
        for i in model.most_similar(word, topn=10):  
            print i[0],i[1]  
        print ''

在目錄下終端執行

python load_model.py

以上輸出結果為：

詞彙行數：400000

以及與 word_list = [u'girl',u'dog'] 最相近的Top 10 單詞

六、測試

測試程式碼如下：

import shutil
import gensim  
model_name='model\.6B.300d'
model = gensim.models.KeyedVectors.load('./'+model_name)  
print len(model.vocab)  
word_list = [u'person',u'pet']
for word in word_list:  
    print word,'--'  
    for i in model.most_similar(word, topn=10):       
        print i[0],i[1]  
        print ''

結果：

person --
someone 0.690635979176

man 0.64434415102

anyone 0.627825558186

woman 0.617089629173

one 0.591174006462

actually 0.579971313477

persons 0.577681422234

people 0.571225821972

else 0.562521100044

somebody 0.560000300407

pet --
pets 0.686407566071

dog 0.629159927368

cat 0.58703649044

dogs 0.545046746731

cats 0.526196360588

animal 0.516855597496

animals 0.507143497467

puppy 0.486273795366

toy 0.430860459805

rabbits 0.420677244663

參考:https://blog.csdn.net/sscssz/article/details/53333225

GloVe模型的理解及實踐（2）

一、執行環境 Ubuntu16.04 + python 3.5 二、安裝gensim 兩種安裝方式 1）開啟終端 sudo easy_install --upgrade gensim 2）開啟終端 pip install gensim 三、Git官方

GloVe模型的理解及實踐（1）

一、GloVe模型的理解 1）原理功能：基於語料庫構建詞的共現矩陣，然後基於共現矩陣和GloVe模型對詞彙進行向量化表示輸入：語料庫　　輸出：詞向量 2）與Skip-Gram、CBOW模型比較例如：句子為"dog barked at the mailman" ，目標

ELMo模型的理解與實踐（2）

預訓練好的詞向量已經released，這裡介紹一下，如何直接獲取ELMo詞向量。在pytorch裡可以通過AlenNLP包使用ELMo。一、環境配置 1) 在conda中建立allennlp環境： conda create -n allennlp python=3.6 2) 安裝a

python實現人臉檢測及識別（2）---- 利用keras庫訓練人臉識別模型

前面已經採集好資料集boss資料夾存放需要識別的物件照片，other存放其他人的訓練集照片，現在，我們終於可以嘗試訓練我們自己的卷積神經網路模型了。CNN擅長影象處理，keras庫的tensorflow版亦支援此種網路模型，萬事俱備，就放開手做吧。前面說過，我們需要通過大量的訓練資料訓練我們的模型，

ELMo模型的理解與實踐（1）

論文：2018 NAACL 《Deep Contextualized Word Representations》一、優點 1.學習單詞的複雜特徵，包括語法、語義 2.學習在不同上下文下的一詞多義二、模型 1.Bidirectional language models（BLM）

知識圖譜學習與實踐（2）——知識圖譜資料模型的構建

資料模型就是資料組織方式，是構建知識圖譜首要解決的問題，無論是開放域的知識圖譜還是行業的知識圖譜，都需要收集大量的資料，這些資料的收集是有選擇性的，這個選擇的依據就是資料模型。資料模型，也可以稱為知識表達模型，解決知識圖譜如何組織資料的問題，是資料的底層架構，是一個知識體系框架，能夠涵蓋住知識圖譜所有的資料，

Docker操作實踐（2）：Docker的安裝及架構介紹

上篇我們分享了一下容器的演進，然後用程序隔離、檔案隔離、namespace、cgroups、libcontainer的幾個角度展開

Android自己定義組件系列【5】——進階實踐（2）

col fonts tle 適配 pack tom ica void log 上一篇《Android自己定義組件系列【5】——進階實踐（1）》中對任老師的《可下拉的PinnedHeaderExpandableListView的實現》前一部分進行了實現，這一篇我們來看看Ex

（7）javascript的程序控制結構及語句------（2）循環控制語句、跳轉語句、對話框

對話 ima bsp .cn while語句 prompt 彈出 asc div 一、循環控制語句循環語句主要就是在滿足條件的情況下反復執行某一個操作，循環控制語句主要包括while語句、do...while語句和for語句。 1、While

軟件工程綜合實踐（2）

mvc 配置 spring sun use expires code red 設計理念軟件工程綜合實踐（2） LoginAction 到底做了什麽？ 1. 通過String username = request.getParameter("username"

web測試實踐（2）

階段 gpo 分享 nbsp image png info 圖片前端今天進行階段一：a 基本功能分析階段一：b 前端性能分析 web測試實踐（2）

TensorFlow車牌識別實踐（2）

準備 flow .com .py pill 檢測遺憾 http pillow http://www.cnblogs.com/jackkwok/p/7228021.html 1，運行準備按照https://github.com/matthewearl/deep-anpr說

智能指針原理及實現（2）- unique_ptr

unique clas 結束基礎無法 body 智能指針周期文件只允許基礎指針的一個所有者。可以移到新所有者(具有移動語義)，但不會復制或共享（即我們無法得到指向同一個對象的兩個unique_ptr）。替換已棄用的 auto_ptr。相較於 boost::s

Linux學習之路-集群及LVS（2）【25】---20180217

ipvs 靜態動態算法一、ipvs scheduleripvs scheduler：根據其調度時是否考慮各RS當前的負載狀態有兩種方法：靜態方法和動態方法1、靜態方法僅根據算法本身進行調度1、RR：roundrobin，輪詢2、WRR：Weighted RR，加權輪詢3、SH：Source H

mysql 創建用戶及授權（2）

logging 暫時 insert star orm exist 二進制包免密用戶密碼一、 MySQL初始密碼新安裝的MySQL默認是沒有密碼的，設置初始密碼可以用以下命 mysqladmin -u root password ‘new-password‘

KVM虛擬化的四種簡單網絡模型介紹及實現（一）

_for only 應該 code eth tun x86_64 信息 dock KVM中的四種簡單網絡模型，分別如下：1、隔離模型：虛擬機之間組建網絡，該模式無法與宿主機通信，無法與其他網絡通信，相當於虛擬機只是連接到一臺交換機上。2、路由模型：相當於虛擬機連接到一臺路由

KVM虛擬化的四種簡單網絡模型介紹及實現（二）

str drive 51cto -c water -a return dfa 模型接上篇，介紹NAT網絡模型和橋接模型。三、NAT模型 NAT模型其實就是SNAT的實現，路由中虛擬機能將報文發送給外部主機，但是外部主機因找不到通往虛擬機的路由因而無法回應請求。但是外部

MATLAB編程與應用系列-第2章數組及矩陣的創建及操作（2）

示例例如 matrix 6.2 由於變量 com 語法 2.4 本系列教程來源於出版設計《基於MATLAB編程基礎與典型應用書籍》，如涉及版權問題，請聯系：[email protected]。出版社：人民郵電出版社，頁數：525。本系列教程目前基於MATLABR20

assimp編譯及使用（2）

本節主要介紹使用各個lib以及dll載入正確載入模型。首先到參考上checkt專案：https://github.com/triplepointfive/ogldev 我們是學習的是tutorial22章節，所以使用這個專案的原始檔。我們新建一個專案，然後把22章節的標頭檔案，以

OpenResty 最佳實踐（2）

此文已由作者湯曉靜授權網易雲社群釋出。歡迎訪問網易雲社群，瞭解更多網易技術產品運營經驗。 lua 協程與 nginx 事件機制結合文章前部分用大量篇幅闡述了 lua 和 nginx 的相關知識，包括 nginx 的程序架構，nginx 的事件迴圈機制，lua 協程，lua 協程如何與 C

GloVe模型的理解及實踐（2）

相關推薦