【中英】【吳恩達課後程式設計作業】Course 5 - 序列模型 - 第二週作業 - 詞向量的運算與Emoji生成器

資料下載

本文所使用的資料已上傳到百度網盤【點選下載(65.99MB)】，請在開始之前下載好所需資料。

【博主使用的python版本：3.6.2】

1 - 詞向量運算

因為詞嵌入的訓練是非常耗資源的，所以大部分人都是選擇載入訓練好的詞嵌入資料。在本部落格中，我們將學習到：

如何載入訓練好了的詞向量
使用餘弦相似性計算相似度
使用詞嵌入來解決“男人與女人相比就像國王與____ 相比”之類的詞語類比問題
修改詞嵌入以減少性別偏見等我們先來載入所需要的包：

import numpy as np
import w2v_utils

接下來就是載入詞向量了，這裡我們使用50維的向量來表示單詞：

words, word_to_vec_map = w2v_utils.read_glove_vecs('data/glove.6B.50d.txt')

我們載入了以下資料：

words：單詞的集合
word_to_vec_map ：字典型別，單詞到GloVe向量的對映

你可以使用下面的語句來看看到底儲存了些什麼：

# python 3.x
print(word_to_vec_map['hello'])

# python 2.x
print word_to_vec_map[ 
'hello']

執行結果:

[-0.38497   0.80092   0.064106 -0.28355  -0.026759 -0.34532  -0.64253
 -0.11729  -0.33257   0.55243  -0.087813  0.9035    0.47102   0.56657
  0.6985   -0.35229  -0.86542   0.90573   0.03576  -0.071705 -0.12327
  0.54923   0.47005   0.35572   1.2611   -0.67581  -0.94983   0.68666
  0.3871   -1.3492    0.63512   0.46416  -0.48814   0.83827  -0.9246
 -0.33722   0.53741  -1.0616   -0.081403 -0.67111   0.30923  -0.3923
 -0.55002  -0.68827   0.58049  -0.11626   0.013139 -0.57654   0.048833
  0.67204 ]

因為獨熱向量不能很好地表示詞語詞之間的相似性，所以使用了GloVe向量，它儲存了每個單詞更多、更有用的資訊，我們現在可以看看如何比較兩個詞的相似性。

1.1 - 餘弦相似度

為了衡量兩個詞的相似程度，我們需要一種方法來衡量兩個詞的詞嵌入向量之間的相似程度，給定兩個向量 $u$ 和 $v$ ，餘弦相似度定義如下:

$CosineSimilarity(u, v) = \frac{u \cdot v}{||u||_2 ||v||_2} = cos(\theta)\tag{1-1}$

其中， $u \cdot v$ 是兩個向量的點積（內積）， $||u||_2$ 是 $u$ 的範數（長度）， $\theta$ 是 $u$ 與 $v$ 之間的夾角角度， $u$ 與 $v$ 之間的相似度是基於他們之間的角度計算的，它們越相似，那麼 $cos(\theta)$ 的值就越接近於1；如果它們很不相似，那麼他們的夾角就越大， $cos(\theta)$ 的值就越接近於-1。

圖 1-1：兩個向量之間的夾角的餘弦值用來衡量它們相似程度

接下來我們要實現一個計算兩個詞的相似度的函式cosine_similarity()

提醒： $u$ 的範數是這樣定義的： $||u||_2 = \sqrt{\sum_{i=1}^{n} u_i^2}$

def cosine_similarity(u, v):
    """
    u與v的餘弦相似度反映了u與v的相似程度
    
    引數：
        u -- 維度為(n,)的詞向量
        v -- 維度為(n,)的詞向量
        
    返回：
        cosine_similarity -- 由上面公式定義的u和v之間的餘弦相似度。
    """
    distance = 0
    
    # 計算u與v的內積
    dot = np.dot(u, v)
    
    #計算u的L2範數
    norm_u = np.sqrt(np.sum(np.power(u, 2)))
    
    #計算v的L2範數
    norm_v = np.sqrt(np.sum(np.power(v, 2)))
    
    # 根據公式1計算餘弦相似度
    cosine_similarity = np.divide(dot, norm_u * norm_v)
    
    return cosine_similarity

我們來測試一下：

father = word_to_vec_map["father"]
mother = word_to_vec_map["mother"]
ball = word_to_vec_map["ball"]
crocodile = word_to_vec_map["crocodile"]
france = word_to_vec_map["france"]
italy = word_to_vec_map["italy"]
paris = word_to_vec_map["paris"]
rome = word_to_vec_map["rome"]

print("cosine_similarity(father, mother) = ", cosine_similarity(father, mother))
print("cosine_similarity(ball, crocodile) = ",cosine_similarity(ball, crocodile))
print("cosine_similarity(france - paris, rome - italy) = ",cosine_similarity(france - paris, rome - italy))

測試結果：

cosine_similarity(father, mother) =  0.890903844289
cosine_similarity(ball, crocodile) =  0.274392462614
cosine_similarity(france - paris, rome - italy) =  -0.675147930817

當然你也可以隨意修改其他的詞彙，然後看看它們之間的相似性。

1.2 - 詞類類比

在這裡，我們將學習解決“A與B相比就類似於C與____相比一樣”之類的問題，打個比方，“男人與女人相比就像國王與女皇相比”。實際上我們需要找到一個詞 $d$ ，然後 $e_a, e_b, e_c, e_d$ 滿足以下關係： $e_b - e_a \approx e_d - e_c$ ，當然， $e_b - e_a$ 與 $e_d - e_c$ 是使用餘弦相似性來做判斷的。

def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    解決“A與B相比就類似於C與____相比一樣”之類的問題
    
    引數：
        word_a -- 一個字串型別的詞
        word_b -- 一個字串型別的詞
        word_c -- 一個字串型別的詞
        word_to_vec_map -- 字典型別，單詞到GloVe向量的對映
        
    返回：
        best_word -- 滿足(v_b - v_a) 最接近 (v_best_word - v_c) 的詞
    """
    
    # 把單詞轉換為小寫
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
    
    # 獲取對應單詞的詞向量
    e_a, e_b, e_c = word_to_vec_map[word_a], word_to_vec_map[word_b], word_to_vec_map[word_c]
    
    # 獲取全部的單詞
    words = word_to_vec_map.keys()
    
    # 將max_cosine_sim初始化為一個比較大的負數
    max_cosine_sim = -100
    best_word = None
    
    # 遍歷整個資料集
    for word in words:
        # 要避免匹配到輸入的資料
        if word in [word_a, word_b, word_c]:
            continue
        # 計算餘弦相似度
        cosine_sim = cosine_similarity((e_b - e_a), (word_to_vec_map[word] - e_c))
        
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = word
            
    return best_word

我們來測試一下：

triads_to_try = [('italy', 'italian', 'spain'), ('india', 'delhi', 'japan'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} <====> {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

測試結果：

italy -> italian <====> spain -> spanish
india -> delhi <====> japan -> tokyo
man -> woman <====> boy -> girl
small -> smaller <====> large -> larger

你可以隨意地去更改上面的詞彙，看看能否拿到自己期望的輸出，你也可以試試能不能讓程式出一點小錯呢？比如：small -> smaller <===> big -> ? ,自己試試唄~

triads_to_try = [('small', 'smaller', 'big')]
for triad in triads_to_try:
    print ('{} -> {} <====> {} -> {}'.format( *triad, complete_analogy(*triad,word_to_vec_map)))

現在詞類類比已經完成了，需要記住的是餘弦相似度是比較詞向量相似度的一種好方法，儘管使用L2距離（歐式距離）來比較也是可以的。

1.3 - 去除詞向量中的偏見（選學）

在這一部分，我們將研究反映在詞嵌入中的性別偏差，並試著去去除這一些偏差，除了學習這個內容外，這一節還可以磨鍊你對單詞向量的直覺，這部分包含有線性代數，不是很難，如果你沒有學習過線性代數，那麼你可以跳過這一節，你也可以繼續深入下去。

我們首先來看一下包含在詞嵌入中的性別偏差，我們首先計算一下 $g = e_{woman} - e_{man}$ ，其中 $e_{woman}$ 是單詞“woman”對應的詞向量， $e_{man}$ 是單詞“man”對應的詞向量，得到的結果 $g$ 粗略的包含了性別這一概念，但是如果你計算 $g_1 = e_{mother} - e_{father}$ 與 $g_2 = e_{girl} - e_{boy}$ 的平均值，可能會更準確一點，但是在這裡， $e_{woman} - e_{man}$ 就已經足夠了。

g = word_to_vec_map['woman'] - word_to_vec_map['man']
print(g)

執行結果:

[-0.087144    0.2182     -0.40986    -0.03922    -0.1032      0.94165
 -0.06042     0.32988     0.46144    -0.35962     0.31102    -0.86824
  0.96006     0.01073     0.24337     0.08193    -1.02722    -0.21122
  0.695044   -0.00222     0.29106     0.5053     -0.099454    0.40445
  0.30181     0.1355     -0.0606     -0.07131    -0.19245    -0.06115
 -0.3204      0.07165    -0.13337    -0.25068714 -0.14293    -0.224957
 -0.149       0.048882    0.12191    -0.27362    -0.165476   -0.20426
  0.54376    -0.271425   -0.10245    -0.32108     0.2516     -0.33455
 -0.04371     0.01258   ]

現在，我們考慮不同單詞與 $g$ 的餘弦相似度，考慮相似度的正值與餘弦相似度的負值之間的關係。

name_list = ['john', 'marie', 'sophie', 'ronaldo', 'priya', 'rahul', 'danielle', 'reza', 'katy', 'yasmin']

for w in name_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

執行結果:

john -0.23163356146
marie 0.315597935396
sophie 0.318687898594
ronaldo -0.312447968503
priya 0.17632041839
rahul -0.169154710392
danielle 0.243932992163
reza -0.079304296722
katy 0.283106865957
yasmin 0.233138577679

正如我們所看見的，女性的名字與 $g$ 的餘弦相似度為正，而男性為負，這也不出乎人的意料，我們來看看其他詞：

word_list = ['lipstick', 'guns', 'science', 'arts', 'literature', 'warrior','doctor', 'tree', 'receptionist', 
             'technology',  'fashion', 'teacher', 'engineer', 'pilot', 'computer', 'singer']
for w in word_list:
    print (w, cosine_similarity(word_to_vec_map[w], g))

執行結果:

lipstick 0.276919162564
guns -0.18884855679
science -0.0608290654093
arts 0.00818931238588
literature 0.0647250443346
warrior -0.209201646411
doctor 0.118952894109
tree -0.0708939917548
receptionist 0.330779417506
technology -0.131937324476
fashion 0.0356389462577
teacher 0.179209234318
engineer -0.0803928049452
pilot 0.00107644989919
computer -0.103303588739
singer 0.185005181365

發現了嗎？比如“computer”就接近於“man”，“literature ”接近於“woman”，但是這些都是不對的一些觀念，那麼我們該如何減少這些偏差呢？

對於一些特殊的詞彙而言，比如“男演員（actor）”與“女演員（actress）”或者“祖母（grandmother）”與“祖父（grandfather）”之間應該是具有性別差異的，但是其他的詞彙比如“接待員（receptionist）”與“技術（technology ）”是不應該有性別差異的，當我們處理這些詞彙的時候應該區別對待。

1.3.1 - 消除與性別無關的詞彙的偏差

下面的一張圖表示了消除偏差之後的效果。如果我們使用的是50維的詞嵌入，那麼50維的空間可以分為兩個部分：偏置方向（ bias-direction） $g$ 與剩下的49維 $g_{\perp}$ ，線上性代數中，我們可以這麼理解：49維的 $g_{\perp}$ 與 $g$ 垂直（正交），即 $g_{\perp}$ 與 $g$ 成90度角。就拿“receptionist”這個單詞舉例，在消除性別偏差的過程中，我們將 $e_{receptionist}$ 沿著 $g$ 方向歸零，得到了 $e^{debiased}_{receptionist}$

#【中英】【吳恩達課後程式設計作業】Course 5

【中英】【吳恩達課後程式設計作業】Course 5 - 序列模型 - 第二週作業 - 詞向量的運算與Emoji生成器

資料下載

1 - 詞向量運算

1.1 - 餘弦相似度

1.2 - 詞類類比

1.3 - 去除詞向量中的偏見（選學）

1.3.1 - 消除與性別無關的詞彙的偏差

#【中英】【吳恩達課後程式設計作業】Course 5

【吳恩達機器學習隨筆】什麽是機器學習？

【吳恩達機器學習筆記】Week4 ex3答案

【吳恩達機器學習筆記】week3：1/2邏輯迴歸

【吳恩達機器學習筆記】第三章：線性迴歸回顧

【吳恩達機器學習筆記】第五章：多變數線性迴歸

【吳恩達機器學習筆記】005 梯度下降（Gradient Descent）

【吳恩達深度學習課程】第一週課後習題

【目錄】【中文】【deplearning.ai】【吳恩達課後作業目錄】

吳恩達機器學習作業程式碼1

吳恩達 machine learning 作業第二週

演算法工程師修仙之路：吳恩達機器學習作業（一）

吳恩達機器學習作業（五）：支援向量機

吳恩達機器學習作業Python實現(六)：SVM支援向量機

吳恩達deeplearning課程作業環境

吳恩達機器學習作業Python實現(一)：線性迴歸

用純Python實現迴圈神經網路RNN向前傳播過程(吳恩達DeepLearning.ai作業)

【吳恩達機器學習】學習筆記——1.5無監督學習

【吳恩達機器學習】學習筆記——2.1單變量線性回歸算法

【吳恩達機器學習】學習筆記——代價函數

#【中英】【吳恩達課後程式設計作業】Course 5

【中英】【吳恩達課後程式設計作業】Course 5 - 序列模型 - 第二週作業 - 詞向量的運算與Emoji生成器

資料下載

1 - 詞向量運算

1.1 - 餘弦相似度

1.2 - 詞類類比

1.3 - 去除詞向量中的偏見（選學）

1.3.1 - 消除與性別無關的詞彙的偏差

相關推薦