python初步實現word2vec
首先宣告,這篇博文是我轉載的,地址:http://blog.csdn.net/xiaoquantouer/article/details/53583980
=====================================================================================================
一、前言
一開始看到word2vec環境的安裝還挺複雜的,安了半天Cygwin也沒太搞懂。後來突然發現,我為什麼要去安c語言版本的呢,我應該去用python版本的,然後就發現了gensim,安裝個gensim的包就可以用word2vec了,不過gensim只實現了word2vec裡面的skip-gram模型。若要用到其他模型,就需要去研究其他語言的word2vec了。
二、語料準備
有了gensim包之後,看了網上很多教程都是直接傳入一個txt檔案,但是這個txt檔案長啥樣,是什麼樣的資料格式呢,很多部落格都沒有說明,也沒有提供可以下載的txt檔案作為例子。進一步理解之後發現這個txt是一個包含巨多文字的分好詞的檔案。如下圖所示,是我自己訓練的一個語料,我選取了自己之前用爬蟲抓取的7000條新聞當做語料並進行分詞。注意,詞與詞之間一定要用空格:
這裡分詞使用的是結巴分詞。
這部分程式碼如下:
- import jieba
- f1 =open("fenci.txt")
-
f2 =open("fenci_result.txt", 'a'
- lines =f1.readlines() # 讀取全部內容
- for line in lines:
- line.replace('\t', '').replace('\n', '').replace(' ','')
- seg_list = jieba.cut(line, cut_all=False)
- f2.write(" ".join(seg_list))
- f1.close()
- f2.close()
還要注意的一點就是語料中的文字一定要多,看網上隨便一個語料都是好幾個G,而且一開始我就使用了一條新聞當成語料庫,結果很不好,輸出都是0。然後我就用了7000條新聞作為語料庫,分詞完之後得到的fenci_result.txt是20M,雖然也不大,但是已經可以得到初步結果了。
三、使用gensim的word2vec訓練模型
相關程式碼如下:
- from gensim.modelsimport word2vec
- import logging
- # 主程式
- logging.basicConfig(format='%(asctime)s:%(levelname)s: %(message)s', level=logging.INFO)
- sentences =word2vec.Text8Corpus(u"fenci_result.txt") # 載入語料
- model =word2vec.Word2Vec(sentences, size=200) #訓練skip-gram模型,預設window=5
- print model
- # 計算兩個詞的相似度/相關程度
- try:
- y1 = model.similarity(u"國家", u"國務院")
- except KeyError:
- y1 = 0
- print u"【國家】和【國務院】的相似度為:", y1
- print"-----\n"
- #
- # 計算某個詞的相關詞列表
- y2 = model.most_similar(u"控煙", topn=20) # 20個最相關的
- print u"和【控煙】最相關的詞有:\n"
- for item in y2:
- print item[0], item[1]
- print"-----\n"
- # 尋找對應關係
- print u"書-不錯,質量-"
- y3 =model.most_similar([u'質量', u'不錯'], [u'書'], topn=3)
- for item in y3:
- print item[0], item[1]
- print"----\n"
- # 尋找不合群的詞
- y4 =model.doesnt_match(u"書 書籍 教材 很".split())
- print u"不合群的詞:", y4
- print"-----\n"
- # 儲存模型,以便重用
- model.save(u"書評.model")
- # 對應的載入方式
- # model_2 =word2vec.Word2Vec.load("text8.model")
- # 以一種c語言可以解析的形式儲存詞向量
- #model.save_word2vec_format(u"書評.model.bin", binary=True)
- # 對應的載入方式
- # model_3 =word2vec.Word2Vec.load_word2vec_format("text8.model.bin",binary=True)
輸出如下:
- "D:\program files\python2.7.0\python.exe""D:/pycharm workspace/畢設/cluster_test/word2vec.py"
- D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:840: UserWarning: detected Windows; aliasing chunkize to chunkize_serial
- warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
- D:\program files\python2.7.0\lib\site-packages\gensim\utils.py:1015: UserWarning: Pattern library is not installed, lemmatization won't be available.
- warnings.warn("Pattern library is not installed, lemmatization won't be available.")
- 2016-12-12 15:37:43,331: INFO: collecting all words and their counts
- 2016-12-12 15:37:43,332: INFO: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
- 2016-12-12 15:37:45,236: INFO: collected 99865 word types from a corpus of 3561156 raw words and 357 sentences
- 2016-12-12 15:37:45,236: INFO: Loading a fresh vocabulary
- 2016-12-12 15:37:45,413: INFO: min_count=5 retains 29982 unique words (30% of original 99865, drops 69883)
- 2016-12-12 15:37:45,413: INFO: min_count=5 leaves 3444018 word corpus (96% of original 3561156, drops 117138)
- 2016-12-12 15:37:45,602: INFO: deleting the raw counts dictionary of 99865 items
- 2016-12-12 15:37:45,615: INFO: sample=0.001 downsamples 29 most-common words
- 2016-12-12 15:37:45,615: INFO: downsampling leaves estimated 2804247 word corpus (81.4% of prior 3444018)
- 2016-12-12 15:37:45,615: INFO: estimated required memory for 29982 words and 200 dimensions: 62962200 bytes
- 2016-12-12 15:37:45,746: INFO: resetting layer weights
- 2016-12-12 15:37:46,782: INFO: training model with 3 workers on 29982 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
- 2016-12-12 15:37:46,782: INFO: expecting 357 sentences, matching count from corpus used for vocabulary survey
- 2016-12-12 15:37:47,818: INFO: PROGRESS: at 1.96% examples, 267531 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:37:48,844: INFO: PROGRESS: at 3.70% examples, 254229 words/s, in_qsize 3, out_qsize 1
- 2016-12-12 15:37:49,871: INFO: PROGRESS: at 5.99% examples, 273509 words/s, in_qsize 3, out_qsize 1
- 2016-12-12 15:37:50,867: INFO: PROGRESS: at 8.18% examples, 281557 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:37:51,872: INFO: PROGRESS: at 10.20% examples, 280918 words/s, in_qsize 5, out_qsize 0
- 2016-12-12 15:37:52,898: INFO: PROGRESS: at 12.44% examples, 284750 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:37:53,911: INFO: PROGRESS: at 14.17% examples, 278948 words/s, in_qsize 0, out_qsize 0
- 2016-12-12 15:37:54,956: INFO: PROGRESS: at 16.47% examples, 284101 words/s, in_qsize 2, out_qsize 1
- 2016-12-12 15:37:55,934: INFO: PROGRESS: at 18.60% examples, 285781 words/s, in_qsize 6, out_qsize 1
- 2016-12-12 15:37:56,933: INFO: PROGRESS: at 20.84% examples, 288045 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:37:57,973: INFO: PROGRESS: at 23.03% examples, 289083 words/s, in_qsize 6, out_qsize 2
- 2016-12-12 15:37:58,993: INFO: PROGRESS: at 24.87% examples, 285990 words/s, in_qsize 6, out_qsize 1
- 2016-12-12 15:38:00,006: INFO: PROGRESS: at 27.17% examples, 288266 words/s, in_qsize 4, out_qsize 1
- 2016-12-12 15:38:01,081: INFO: PROGRESS: at 29.52% examples, 290197 words/s, in_qsize 1, out_qsize 2
- 2016-12-12 15:38:02,065: INFO: PROGRESS: at 31.88% examples, 292344 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:03,188: INFO: PROGRESS: at 34.01% examples, 291356 words/s, in_qsize 2, out_qsize 2
- 2016-12-12 15:38:04,161: INFO: PROGRESS: at 36.02% examples, 290805 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:05,174: INFO: PROGRESS: at 38.26% examples, 292174 words/s, in_qsize 3, out_qsize 0
- 2016-12-12 15:38:06,214: INFO: PROGRESS: at 40.56% examples, 293297 words/s, in_qsize 4, out_qsize 1
- 2016-12-12 15:38:07,201: INFO: PROGRESS: at 42.69% examples, 293428 words/s, in_qsize 4, out_qsize 1
- 2016-12-12 15:38:08,266: INFO: PROGRESS: at 44.65% examples, 292108 words/s, in_qsize 1, out_qsize 1
- 2016-12-12 15:38:09,295: INFO: PROGRESS: at 46.83% examples, 292097 words/s, in_qsize 4, out_qsize 1
- 2016-12-12 15:38:10,315: INFO: PROGRESS: at 49.13% examples, 292968 words/s, in_qsize 2, out_qsize 2
- 2016-12-12 15:38:11,326: INFO: PROGRESS: at 51.37% examples, 293621 words/s, in_qsize 5, out_qsize 0
- 2016-12-12 15:38:12,367: INFO: PROGRESS: at 53.39% examples, 292777 words/s, in_qsize 2, out_qsize 2
- 2016-12-12 15:38:13,348: INFO: PROGRESS: at 55.35% examples, 292187 words/s, in_qsize 5, out_qsize 0
- 2016-12-12 15:38:14,349: INFO: PROGRESS: at 57.31% examples, 291656 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:15,374: INFO: PROGRESS: at 59.50% examples, 292019 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:16,403: INFO: PROGRESS: at 61.68% examples, 292318 words/s, in_qsize 4, out_qsize 2
- 2016-12-12 15:38:17,401: INFO: PROGRESS: at 63.81% examples, 292275 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:18,410: INFO: PROGRESS: at 65.71% examples, 291495 words/s, in_qsize 4, out_qsize 1
- 2016-12-12 15:38:19,433: INFO: PROGRESS: at 67.62% examples, 290443 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:20,473: INFO: PROGRESS: at 69.58% examples, 289655 words/s, in_qsize 6, out_qsize 2
- 2016-12-12 15:38:21,589: INFO: PROGRESS: at 71.71% examples, 289388 words/s, in_qsize 2, out_qsize 2
- 2016-12-12 15:38:22,533: INFO: PROGRESS: at 73.78% examples, 289366 words/s, in_qsize 0, out_qsize 1
- 2016-12-12 15:38:23,611: INFO: PROGRESS: at 75.46% examples, 287542 words/s, in_qsize 5, out_qsize 1
- 2016-12-12 15:38:24,614: INFO: PROGRESS: at 77.25% examples, 286609 words/s, in_qsize 3, out_qsize 0
- 2016-12-12 15:38:25,609: INFO: PROGRESS: at 79.33% examples, 286732 words/s, in_qsize 5, out_qsize 1
- 2016-12-12 15:38:26,621: INFO: PROGRESS: at 81.40% examples, 286595 words/s, in_qsize 2, out_qsize 0
- 2016-12-12 15:38:27,625: INFO: PROGRESS: at 83.53% examples, 286807 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:28,683: INFO: PROGRESS: at 85.32% examples, 285651 words/s, in_qsize 5, out_qsize 3
- 2016-12-12 15:38:29,729: INFO: PROGRESS: at 87.56% examples, 286175 words/s, in_qsize 6, out_qsize 1
- 2016-12-12 15:38:30,706: INFO: PROGRESS: at 89.86% examples, 286920 words/s, in_qsize 5, out_qsize 0
- 2016-12-12 15:38:31,714: INFO: PROGRESS: at 92.10% examples, 287368 words/s, in_qsize 6, out_qsize 0
- 2016-12-12 15:38:32,756: INFO: PROGRESS: at 94.40% examples, 288070 words/s, in_qsize 4, out_qsize 2
- 2016-12-12 15:38:33,755: INFO: PROGRESS: at 96.30% examples, 287543 words/s, in_qsize 1, out_qsize 0
- 2016-12-12 15:38:34,802: INFO: PROGRESS: at 98.71% examples, 288375 words/s, in_qsize 4, out_qsize 0
- 2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 2 more threads
- 2016-12-12 15:38:35,286: INFO: worker thread finished; awaiting finish of 1 more threads
- Word2Vec(vocab=29982, size=200, alpha=0.025)
- 【國家】和【國務院】的相似度為: 0.387535493256
- -----
- 2016-12-12 15:38:35,293: INFO: worker thread finished; awaiting finish of 0 more threads
- 2016-12-12 15:38:35,293: INFO: training on 17805780 raw words (14021191 effective words) took 48.5s, 289037 effective words/s
- 2016-12-12 15:38:35,293: INFO: precomputing L2-norms of word weight vectors
- 和【控煙】最相關的詞有:
- 禁菸 0.6038454175
- 防煙 0.585186183453
- 執行 0.530897378922
- 煙控 0.516572892666
- 廣而告之 0.508533298969
- 履約 0.507428050041
- 執法 0.494115233421
- 禁菸令 0.471616715193
- 修法 0.465247869492
- 該項 0.457907706499
- 落實 0.457776963711
- 控制 0.455987215042
- 這方面 0.450040221214
- 立法 0.44820779562
- 控煙辦 0.436062157154
- 執行力 0.432559013367
- 控煙會 0.430508673191
- 進展 0.430286765099
- 監管 0.429748386145
- 懲罰 0.429243773222
-
相關推薦
python初步實現word2vec
首先宣告,這篇博文是我轉載的,地址:http://blog.csdn.net/xiaoquantouer/article/details/53583980 ===========================================================
python 自動錄屏初步實現
.get box 導出視頻 while dir 其它 img 一個 read 偶爾需要用python實現錄屏並導出視頻,隨便寫了一個簡單的代碼。 大概思路是先每隔一段時間截屏,然後將截屏圖片變成視頻導出。 1 # -*- coding: cp936 -*- 2 fro
Libcurl的初步實現tfp上傳下載功能
rtmp 細致 helib art download close 2.3 article 可能 該學習筆記的目標是利用libcurl實現ftp文件上傳和下載功能 一、Libcurlde的簡單介紹 Libcurl是一個免費的而且易於使用的利用url進行文件傳輸的庫。,
求斐波那契數的python語言實現---遞歸和叠代
put bsp print span return spa number n-2 遞歸實現 叠代實現如下: def fab(n): n1 = 1 n2 = 1 if n<1: print
tensorflow實現Word2vec
while brush ber ear same split max ems red # coding: utf-8 ‘‘‘ Note: Step 3 is missing. That‘s why I left it. ‘‘‘ from __future__ impor
python初步——基礎總結(一)
python 自動化運維一. 第一個HelloWorld 1. 在linux 執行 (python2.7)[[email protected]/* */ ~]# vim test.py #!/usr/bin/env python
Python如何實現單步調試
out 程序 關聯 如何實現 格式 none aps pen tid 遇到大型python項目,如何定位問題和監控程序的運行狀態是一個程序員必須掌握的技能,今天小編為你帶來python程序的單步調試方法,方便易用,簡單易記! 首先你需要在所調試程序的開頭中:impo
學習筆記TF034:實現Word2Vec
給定 scipy net logical poi span bus similar 展示 卷積神經網絡發展趨勢。Perceptron(感知機),1957年,Frank Resenblatt提出,始祖。Neocognitron(神經認知機),多層級神經網絡,日本科學家Kuni
Python字典實現簡單的三級菜單
直轄市 test 什麽 bre python字典 山東 oldboy als 實現 data = { "北京":{ "昌平":{"沙河":["oldboy","test"],"天通苑":["鏈接地產","我愛我家"]}, "朝陽":{"望京
python tornado 實現類禪道系統
開發 ges images path init obi style 我的博客 集成 最近樓主加班 嘍, 好久沒有更新我的博客了,哎,一言難盡,廢話我就不說了,來開始上精華。 背景:目前市面上有很多bug管理工具,但是各有各的特點,最著名,最流行的就是禪道,一個偶然的
Python-Day4實現簡單的shell sed替換功能
color utf-8 ast style 實現 nes 寫入 readlines lov code: 1 f = open(‘yesterday‘,‘r‘,encoding=‘utf-8‘) 2 f2 = open(‘yesterday.bak‘,‘w‘,encod
python繼承實現的原理
spa ont 例如 計算 bar class border flow ges python基礎——繼承實現的原理 1 繼承順序 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
DNN的BP算法Python簡單實現
images 相等 1.0 [] 重要 work arange imp into BP算法是神經網絡的基礎,也是最重要的部分。由於誤差反向傳播的過程中,可能會出現梯度消失或者爆炸,所以需要調整損失函數。在LSTM中,通過sigmoid來實現三個門來解決記憶問題,用tenso
python如何實現像shell中的case功能
short python3.5 [0 語句 type shell中的 url -- 如何實現 我們知道在shell腳本裏是支持case語句,當位置參數為空時,會提示我們怎麽使用腳本 那麽在python怎麽實現呢?也使用case嗎? python裏不支持case語句,但是也有
(轉)Python字典實現三級菜單
青島 退出 and spa 獲取值 body als 名稱 結束 Python字典實現三級菜單 原文:https://www.cnblogs.com/pyramid1001/p/5803294.html 1 ###########################
tesonflow實現word2Vec
random lis split 字典 nload dex gradient 出現 hone word2Vec 是實現從原始語料中學習字詞空間向量的預測模型 使用word2Vec的skip_Gram模型 import collections import math imp
Python django實現簡單的郵件系統發送郵件功能
conn ive smtplib send sub 方式 fix www. obj Python django實現簡單的郵件系統發送郵件功能 本文實例講述了Python django實現簡單的郵件系統發送郵件功能。 django郵件系統 Djang
python https實現方法
out 參數 __name__ sta rom boot lis install 代理服務 1、安裝pyOpenSSL模塊 pip install pyOpenSSL 如果flask是在python3環境下安裝的,記得切換到python3的虛擬環境 2、上傳證書文件到項目根
python epoll實現異步socket
add 給定 bsp pen 子進程 from lose urg nts 一、同步和異步: 在程序執行中,同步運行意味著等待調用的函數、線程、子進程等的返回結果後繼續處理;異步指不等待當下的返回結果,直接運行主進程下面的程序,等到有返回結果時,通知主進程處理。有點高效。 二
Python學習 —— 實現簡單爬蟲
發現 guid openssl 取圖 lib 列表 了解 菜鳥 頁面數據 為了加快學習python3.x於是直接看了許多有實際操作的小項目,查了許多資料後寫了這個腳本,這個腳本主要是爬取百度圖片‘東方幻想鄉‘的圖片,但發現有幾個問題: 1.圖片會重復兩次。