word2vec python 介面安裝使用
分享一下我老師大神的人工智慧教程!零基礎,通俗易懂!http://blog.csdn.net/jiangjunshow
也歡迎大家轉載本篇文章。分享知識,造福人民,實現我們中華民族偉大復興!
https://github.com/danielfrg/word2vec
Installation
I recommend the Anaconda python distribution
pip install word2vec
Wheel: Wheels packages for OS X and Windows are provided on Pypi on a best effort sense. The code is quite easy to compile so consider using: --no-use-wheel
on Linux and OS X.
Linux: There is no wheel support for linux so you have to compile the C code. The only requirement is gcc
CFLAGS='-march=corei7' pip install word2vec
Windows: Very experimental support based this win32 port
%load_ext autoreload%autoreload 2
word2vec
This notebook is equivalent to demo-word.sh
, demo-analogy.sh
, demo-phrases.sh
and demo-classes.sh
from Google.
Training
Download some data, for example: http://mattmahoney.net/dc/text8.zip
In [2]:import word2vec
Run word2phrase
to group up similar words "Los Angeles" to "Los_Angeles"
word2vec.word2phrase('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-phrases', verbose=True)
[u'word2phrase', u'-train', u'/Users/drodriguez/Downloads/text8', u'-output', u'/Users/drodriguez/Downloads/text8-phrases', u'-min-count', u'5', u'-threshold', u'100', u'-debug', u'2']Starting training using file /Users/drodriguez/Downloads/text8Words processed: 17000K Vocab size: 4399K Vocab size (unigrams + bigrams): 2419827Words in train file: 17005206
This will create a text8-phrases
that we can use as a better input for word2vec
. Note that you could easily skip this previous step and use the origial data as input for word2vec
.
Train the model using the word2phrase
output.
word2vec.word2vec('/Users/drodriguez/Downloads/text8-phrases', '/Users/drodriguez/Downloads/text8.bin', size=100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8-phrasesVocab size: 98331Words in train file: 15857306Alpha: 0.000002 Progress: 100.03% Words/thread/sec: 286.52k
That generated a text8.bin
file containing the word vectors in a binary format.
Do the clustering of the vectors based on the trained model.
In [5]:word2vec.word2clusters('/Users/drodriguez/Downloads/text8', '/Users/drodriguez/Downloads/text8-clusters.txt', 100, verbose=True)
Starting training using file /Users/drodriguez/Downloads/text8Vocab size: 71291Words in train file: 16718843Alpha: 0.000002 Progress: 100.02% Words/thread/sec: 287.55k
That created a text8-clusters.txt
with the cluster for every word in the vocabulary
Predictions
In [1]:import word2vec
Import the word2vec
binary file created above
model = word2vec.load('/Users/drodriguez/Downloads/text8.bin')
We can take a look at the vocabulaty as a numpy array
In [3]:model.vocabOut[3]:
array([u'</s>', u'the', u'of', ..., u'dakotas', u'nias', u'burlesques'], dtype='<U78')
Or take a look at the whole matrix
In [4]:model.vectors.shapeOut[4]:
(98331, 100)In [5]:
model.vectorsOut[5]:
array([[ 0.14333282, 0.15825513, -0.13715845, ..., 0.05456942, 0.10955409, 0.00693387], [ 0.1220774 , 0.04939618, 0.09545057, ..., -0.00804222, -0.05441621, -0.10076696], [ 0.16844609, 0.03734054, 0.22085373, ..., 0.05854521, 0.04685341, 0.02546694], ..., [-0.06760896, 0.03737842, 0.09344187, ..., 0.14559349, -0.11704484, -0.05246212], [ 0.02228479, -0.07340827, 0.15247506, ..., 0.01872172, -0.18154132, -0.06813737], [ 0.02778879, -0.06457976, 0.07102411, ..., -0.00270281, -0.0471223 , -0.135444 ]])
We can retreive the vector of individual words
In [6]:model['dog'].shapeOut[6]:
(100,)In [7]:
model['dog'][:10]Out[7]:
array([ 0.05753701, 0.0585594 , 0.11341395, 0.02016246, 0.11514406, 0.01246986, 0.00801256, 0.17529851, 0.02899276, 0.0203866 ])
We can do simple queries to retreive words similar to "socks" based on cosine similarity:
In [8]:indexes, metrics = model.cosine('socks')indexes, metricsOut[8]:
(array([20002, 28915, 30711, 33874, 27482, 14631, 22992, 24195, 25857, 23705]), array([ 0.8375354 , 0.83590846, 0.82818749, 0.82533614, 0.82278399, 0.81476386, 0.8139092 , 0.81253798, 0.8105933 , 0.80850171]))
This returned a tuple with 2 items:
- numpy array with the indexes of the similar words in the vocabulary
- numpy array with cosine similarity to each word
Its possible to get the words of those indexes
In [9]:model.vocab[indexes]Out[9]:
array([u'hairy', u'pumpkin', u'gravy', u'nosed', u'plum', u'winged', u'bock', u'petals', u'biscuits', u'striped'], dtype='<U78')
There is a helper function to create a combined response: a numpy record array
In [10]:model.generate_response(indexes, metrics)Out[10]:
rec.array([(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809), (u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071), (u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592), (u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767), (u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)], dtype=[(u'word', '<U78'), (u'metric', '<f8')])
Is easy to make that numpy array a pure python response:
In [11]:model.generate_response(indexes, metrics).tolist()Out[11]:
[(u'hairy', 0.8375353970603848), (u'pumpkin', 0.8359084628493809), (u'gravy', 0.8281874915608026), (u'nosed', 0.8253361379785071), (u'plum', 0.8227839904046932), (u'winged', 0.8147638561412592), (u'bock', 0.8139092031538545), (u'petals', 0.8125379796045767), (u'biscuits', 0.8105933044655644), (u'striped', 0.8085017054444408)]
Phrases
Since we trained the model with the output of word2phrase
we can ask for similarity of "phrases"
indexes, metrics = model.cosine('los_angeles')model.generate_response(indexes, metrics).tolist()Out[12]:
[(u'san_francisco', 0.886558000570455), (u'san_diego', 0.8731961018831669), (u'seattle', 0.8455603712285231), (u'las_vegas', 0.8407843553947962), (u'miami', 0.8341796009062884), (u'detroit', 0.8235412519780195), (u'cincinnati', 0.8199138493085706), (u'st_louis', 0.8160655356728751), (u'chicago', 0.8156786240847214), (u'california', 0.8154244925085712)]
Analogies
Its possible to do more complex queries like analogies such as: king - man + woman = queen
This method returns the same as cosine
the indexes of the words in the vocab and the metric
indexes, metrics = model.analogy(pos=['king', 'woman'], neg=['man'], n=10)indexes, metricsOut[13]:
(array([1087, 1145, 7523, 3141, 6768, 1335, 8419, 1826, 648, 1426]), array([ 0.2917969 , 0.27353295, 0.26877692, 0.26596514, 0.26487509, 0.26428581, 0.26315492, 0.26261258, 0.26136635, 0.26099078]))In [14]:
model.generate_response(indexes, metrics).tolist()Out[14]:
[(u'queen', 0.2917968955611075), (u'prince', 0.27353295205311695), (u'empress', 0.2687769174818083), (u'monarch', 0.2659651399832089), (u'regent', 0.26487508713026797), (u'wife', 0.2642858109968327), (u'aragon', 0.2631549214361766), (u'throne', 0.26261257728511833), (u'emperor', 0.2613663460665488), (u'bishop', 0.26099078142148696)]
Clusters
In [15]:clusters = word2vec.load_clusters('/Users/drodriguez/Downloads/text8-clusters.txt')
We can see get the cluster number for individual words
In [16]:clusters['dog']Out[16]:
11
We can see get all the words grouped on an specific cluster
In [17]:clusters.get_words_on_cluster(90).shapeOut[17]:
(221,)In [18]:
clusters.get_words_on_cluster(90)[:10]Out[18]:
array(['along', 'together', 'associated', 'relationship', 'deal', 'combined', 'contact', 'connection', 'bond', 'respect'], dtype=object)
We can add the clusters to the word2vec model and generate a response that includes the clusters
In [19]:model.clusters = clustersIn [20]:
indexes, metrics = model.analogy(pos=['paris', 'germany'], neg=['france'], n=10)In [21]:
model.generate_response(indexes, metrics).tolist()Out[21]:
[(u'berlin', 0.32333651414395953, 20), (u'munich', 0.28851564633559, 20), (u'vienna', 0.2768927258877336, 12), (u'leipzig', 0.2690537010929304, 91), (u'moscow', 0.26531859560322785, 74), (u'st_petersburg', 0.259534503067277, 61), (u'prague', 0.25000637367753303, 72), (u'dresden', 0.2495974800117785, 71), (u'bonn', 0.24403155303236473, 8), (u'frankfurt', 0.24199720792200027, 31)]In [ ]: