Stanford 224N- Word Vector Representations: Word2Vec
word vector: model of word meaning (vectors that can encode the word meanings)
用分類資源來處理詞義 WordNet: taxonomy information about words
nltk.corpus語料庫 import wordnet
synonyms同義詞,一詞多義;but 這些分類的描述存在很多nuances, incomplete. People will use words more flexible
很難對詞彙的相似性給出準確的定義,can't get many nuances in WordNet
Can build a very big vector, each word assign for 1."one-hot" representation (localist store representation)→ dot product = 0
but doesn't give any inherent notion of relationships between words. We need to include similarity.
Encodes a way that you can just read the similarity between words!!! dot product
Use distributional similarity(分佈相似性):
get a lot of value by looking at the contexts in which it appears, 並看到和它們一起出現的詞,統計
“You shall know a word by the company it keeps”
Word vector(distributed representations, 通過分佈相似性構建) 的數字要讓它預測目標單詞所在文字的其他詞彙(詞與詞之間可以相互預測)
像概率分佈的感覺,中心詞+外邊context words
What is word2vec?
recipe in general for learning neural word embeddings -> predict between a center word and words that appear in its context
p(context words| wt as center) = probability of perfectly predict the words around the word
J = 1 - p(w -t | w t) loss function (wt是中心詞,w -t是圍繞在中心詞周圍的其他單詞)
Change the representaion of words to minimize our loss!!
Use deep learning: you set this goal and say nothing else about this, depend on DL! the output vectors are powerful!
powerful word vectors!
- Having distributed representations of words 2. and able to predict other words in context!
How that happens?
word2vec: predict between every words and its context words to learn word vectors
two algorithms:
1.skip-grams(SG)
for each estimation step, taking one word as the center word. Then try to predict words in its context in some window size(在一定句長內預測)
模型將定義一個概率分佈:即給定一箇中心詞彙,某個單詞在他上下文中出現的概率;選取詞向量表示,以讓概率最大化
對一個詞彙,我們有且僅有一個概率分佈
for each word t =1,2, ... T, predicting surrounding words in a window of radius m
- get big long sequences of words to learn (enough words in the context)
- go through each position in the text, have a window of 2m size around it
- have a probability distribution
- set the parameters of our model (that is, the representation vector of the word) to maximize p
lecture2: 27分鐘左右
see ipad note
one word one vector representation?
2 can be better: 一個是作為中心詞的向量v,一個作為上下文詞的向量u
don't pay attention to distance and position
for each position in the context, multiple the center vector by a matrix that stores the representation of context words
get the dot product for similarity(uoTvc,乘以每一行uo),uoTvc只有一個結果向量,上下結果代表三次vc變換
softmax, predict the context word 範圍內最可能出現的詞
w為所有詞作為中心片語成的權重矩陣(word embedding詞向量矩陣),w‘為所有上下文片語成的權重矩陣
different prediction: loss
w and w‘ (v and u):
Train the model: compute vector representations
put u and v vector of one word together to construct a 2d vector
this 2d vector is the thing going to be optimizing
Gradient!
PAPER: A simple but Tough-to-beat baseline for sentence embeddings
How to encode the meaning of sentences? (not word as mentioned above)
Sentence Embedding:
- Compute sentence similarity through inner dot product
- Sentence classification task, like sentiment analysis
Compose word representations into sentence representations:
- the bag-of-words (BoW): average of word vector
- CNN, RNN
This paper: a very simple unsupervised method
weighted BoW + remove some special direction
Step 1:
compute word vector representations, each word has a seperate weight ???
weight = a/a+f(w)
Step 2:
compute first principle component
將句子嵌入與詞頻率,詞與相關程度結合起來
(GD)梯度下降,斜率值在不斷下降;learning rate a 必須足夠小,防止跳過了最小值
如果對400億個詞求梯度,無法完成: not practical!!
作為代替:隨機梯度下降(SGD,Stochastic Gradient Descent)
只選取一箇中心詞彙,有了它周圍的詞彙;移動這一個位置,對所有引數求解梯度(using that estimate of the gradient)?
不一定是一個好的方向的估計 (基線),但是比GD快幾個數量級
neural networks love noise!