論文分享-->word2Vec論文總結

一直以來，對word2vec，以及對 tensorflow 裡面的wordEmbedding底層實現原理一直模糊不清，由此決心閱讀word2Vec的兩篇原始論文，EfficientEstimationofWordRepresentationsinVectorSpace，DistributedRepresentationsofWordsandPhrasesandtheirCompositionality，看完以後還是有點半懂半不懂的感覺，於是又結合網上的一些比較好的講解（Word2Vec Tutorial - The Skip-Gram Model)，以及開源的實現程式碼理解了一遍，在此總結一下。
這裡寫圖片描述

下面主要以 skip−gram 模型來介紹word2Vec。

word2vec工作流程

word2Vec只是一個三層的神經網路。
餵給模型一個word，然後用來預測它周邊的詞。
然後去掉最後一層，只儲存input_layer 和 hidden_layer。
從詞表中選取一個詞，餵給模型，在hidden_layer 將會給出該詞的embeddingrepesentation。

import numpy as np
import tensorflow as tf
corpus_raw = 'He is the king . The king is royal . She is the 
 royal  queen '
# convert to lower case
corpus_raw = corpus_raw.lower()

上述程式碼非常簡單和易懂，現在我們需要獲取inputoutputpair，假設我們現在有這樣一個任務，餵給模型一個詞，我們需要獲取它周邊的詞，舉例來說，就是獲取該詞前n個和後n個詞，那麼這個n就是程式碼中的window_size，例如下圖：

這裡寫圖片描述

注意：如果這個詞是一個句子的開頭或結尾，window 忽略窗外的詞。

我們需要對文字資料進行一個簡單的預處理，建立一個word2int的字典和int2word的字典。

words = []
for word 
 in corpus_raw.split():
    if word != '.': # because we don't want to treat . as a word
        words.append(word)
words = set(words) # so that all duplicate words are removed
word2int = {}
int2word = {}
vocab_size = len(words) # gives the total number of unique words
for i,word in enumerate(words):
    word2int[word] = i
    int2word[i] = word

來看看這個字典有啥效果：

print(word2int['queen'])
-> 42 (say)
print(int2word[42])
-> 'queen'

好，現在可以獲取訓練資料啦

data = []
WINDOW_SIZE = 2
for sentence in sentences:
    for word_index, word in enumerate(sentence):
        for nb_word in sentence[max(word_index - WINDOW_SIZE, 0) : min(word_index + WINDOW_SIZE, len(sentence)) + 1] : 
            if nb_word != word:
                data.append([word, nb_word])

上述程式碼就是切句子，然後切詞，得出的一個個訓練樣本[word,nb_word]，其中word就是模型輸入，nb_word就是該詞周邊的某個單詞。

把data打印出來看看？

print(data)
[['he', 'is'],
 ['he', 'the'],
 ['is', 'he'],
 ['is', 'the'],
 ['is', 'king'],
 ['the', 'he'],
 ['the', 'is'],
 ['the', 'king'],
.
.
.
]

現在我們有了訓練資料了，但是需要將它轉成模型可讀可理解的形式，這時，上面的word2int字典的作用就來了。

來，我們更進一步的對word進行處理，並使其轉成one−hot向量

i.e., 
say we have a vocabulary of 3 words : pen, pineapple, apple
where 
word2int['pen'] -> 0 -> [1 0 0]
word2int['pineapple'] -> 1 -> [0 1 0]
word2int['apple'] -> 2 -> [0 0 1]

那麼為啥是one−hot特徵呢？稍後將解釋。

# function to convert numbers to one hot vectors
def to_one_hot(data_point_index, vocab_size):
    temp = np.zeros(vocab_size)
    temp[data_point_index] = 1
    return temp
x_train = [] # input word
y_train = [] # output word
for data_word in data:
    x_train.append(to_one_hot(word2int[ data_word[0] ], vocab_size))
    y_train.append(to_one_hot(word2int[ data_word[1] ], vocab_size))
# convert them to numpy arrays
x_train = np.asarray(x_train)
y_train = np.asarray(y_train)

利用tensorflow建立模型

# making placeholders for x_train and y_train
x = tf.placeholder(tf.float32, shape=(None, vocab_size))
y_label = tf.placeholder(tf.float32, shape=(None, vocab_size))

這裡寫圖片描述

由上圖，我們可以看出，我們將input轉換成embedding_representation，並且將

論文分享-->word2Vec論文總結

word2vec工作流程

利用tensorflow建立模型

論文分享-->word2Vec論文總結

DNN論文分享 - Item2vec: Neural Item Embedding for Collaborative Filtering

轉【研究生第一篇學術論文常犯問題總結】

Miccai論文分享（一）Deep Multi-instance Networks with Sparse Label Assignment for Whole Mammogram Classific

論文分享-- >From RankNet to LambdaRank to LambdaMART: An Overview

研究生第一篇學術論文常犯問題總結【喻海良箴言】

研究生第一篇學術論文常犯問題總結

Latex論文排版技巧再總結

論文分享--- >Learning to Rank: From Pairwise Approach to Listwise Approach

研究生第一篇學術論文常犯問題總結[轉]

第一次寫論文後的經驗總結和新手攻略

看懂資訊檢索和網路資料探勘領域論文的必備知識總結

轉載[研究生第一篇學術論文常犯問題總結]

AAAI 2020論文分享：通過識別和翻譯互動打造更優的語音翻譯模型

論文分享：用於模型解釋的對抗不忠學習

[論文分享] DHP: Differentiable Meta Pruning via HyperNetworks

iOS友盟分享的使用總結

1109Appium app自動化測試經驗分享-Xpath定位總結

1105Selenium web自動化測試經驗分享-CSS定位總結

電子電氣工程師必知必會（第二版）分享讀後總結 -- 模擬部分

論文分享-->word2Vec論文總結

word2vec工作流程

利用tensorflow建立模型

相關推薦