1. 程式人生 > >Stanford 224N- Word Vector Representations: Word2Vec

Stanford 224N- Word Vector Representations: Word2Vec

word vector: model of word meaning (vectors that can encode the word meanings)
用分類資源來處理詞義 WordNet: taxonomy information about words
nltk.corpus語料庫 import wordnet
synonyms同義詞,一詞多義;but 這些分類的描述存在很多nuances, incomplete. People will use words more flexible
很難對詞彙的相似性給出準確的定義,can't get many nuances in WordNet

Can build a very big vector, each word assign for 1."one-hot" representation (localist store representation)→ dot product = 0
but doesn't give any inherent notion of relationships between words. We need to include similarity.
Encodes a way that you can just read the similarity between words!!! dot product


 

 

Use distributional similarity(分佈相似性):
get a lot of value by looking at the contexts in which it appears, 並看到和它們一起出現的詞,統計
“You shall know a word by the company it keeps”
Word vector(distributed representations, 通過分佈相似性構建) 的數字要讓它預測目標單詞所在文字的其他詞彙(詞與詞之間可以相互預測)
像概率分佈的感覺,中心詞+外邊context words

What is word2vec?
recipe in general for learning neural word embeddings -> predict between a center word and words that appear in its context
p(context words| wt as center) = probability of perfectly predict the words around the word
J = 1 - p(w -t | w t) loss function (wt是中心詞,w -t是圍繞在中心詞周圍的其他單詞)
Change the representaion of words to minimize our loss!!
Use deep learning: you set this goal and say nothing else about this, depend on DL! the output vectors are powerful!

powerful word vectors!

  1. Having distributed representations of words 2. and able to predict other words in context!

How that happens?
word2vec: predict between every words and its context words to learn word vectors
two algorithms:
1.skip-grams(SG)
for each estimation step, taking one word as the center word. Then try to predict words in its context in some window size(在一定句長內預測)
模型將定義一個概率分佈:即給定一箇中心詞彙,某個單詞在他上下文中出現的概率;選取詞向量表示,以讓概率最大化
對一個詞彙,我們有且僅有一個概率分佈

for each word t =1,2, ... T, predicting surrounding words in a window of radius m

  1. get big long sequences of words to learn (enough words in the context)
  2. go through each position in the text, have a window of 2m size around it
  3. have a probability distribution
  4. set the parameters of our model (that is, the representation vector of the word) to maximize p

lecture2: 27分鐘左右

 


see ipad note

one word one vector representation?
2 can be better: 一個是作為中心詞的向量v,一個作為上下文詞的向量u
don't pay attention to distance and position

 


for each position in the context, multiple the center vector by a matrix that stores the representation of context words
get the dot product for similarity(uoTvc,乘以每一行uo),uoTvc只有一個結果向量,上下結果代表三次vc變換
softmax, predict the context word 範圍內最可能出現的詞
w為所有詞作為中心片語成的權重矩陣(word embedding詞向量矩陣),w‘為所有上下文片語成的權重矩陣
different prediction: loss
w and w‘ (v and u):

 

 


Train the model: compute vector representations
put u and v vector of one word together to construct a 2d vector
this 2d vector is the thing going to be optimizing
Gradient!

 

PAPER: A simple but Tough-to-beat baseline for sentence embeddings
How to encode the meaning of sentences? (not word as mentioned above)
Sentence Embedding:

  1. Compute sentence similarity through inner dot product
  2. Sentence classification task, like sentiment analysis

Compose word representations into sentence representations:

  1. the bag-of-words (BoW): average of word vector
  1. CNN, RNN

This paper: a very simple unsupervised method
weighted BoW + remove some special direction
Step 1:
compute word vector representations, each word has a seperate weight ???
weight = a/a+f(w)
Step 2:
compute first principle component
將句子嵌入與詞頻率,詞與相關程度結合起來

 


(GD)梯度下降,斜率值在不斷下降;learning rate a 必須足夠小,防止跳過了最小值
如果對400億個詞求梯度,無法完成: not practical!!
作為代替:隨機梯度下降(SGD,Stochastic Gradient Descent)
只選取一箇中心詞彙,有了它周圍的詞彙;移動這一個位置,對所有引數求解梯度(using that estimate of the gradient)?
不一定是一個好的方向的估計 (基線),但是比GD快幾個數量級
neural networks love noise!