字元級或單詞級的one-hot編碼 VS 詞嵌入(keras實現)
阿新 • • 發佈:2018-11-16
1. one-hot編碼
# 字符集的one-hot編碼 import string samples = ['zzh is a pig','he loves himself very much','pig pig han'] characters = string.printable token_index = dict(zip(range(1,len(characters)+1),characters)) max_length =20 results = np.zeros((len(samples),max_length,max(token_index.keys()) + 1)) characters= '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVW XYZ!"#$%&\'()*+,-./:;<=>[email protected] |
|
# keras實現單詞級的one-hot編碼 from keras.preprocessing.text import Tokenizer samples = ['zzh is a pig','he loves himself very much','pig pig han'] tokenizer = Tokenizer(num_words = 100) #建立一個分詞器(tokenizer),設定為只考慮前1000個最常見的單詞 tokenizer.fit_on_texts(samples)#
|
sequences = [[2, 3, 4, 1], 發現10個unique標記 {'pig': 1, 'zzh': 2, 'is': 3, 'a': 4, 'he': 5, |
one-hot 編碼的一種辦法是 one-hot雜湊技巧(one-hot hashing trick)如果詞表中唯一標記的數量太大而無法直接處理,就可以使用這種技巧。這種方法沒有為每個單詞顯示的分配一個索引並將這些索引儲存在一個字典中,而是將單詞雜湊編碼為固定長度的向量,通常用一個非常簡單的雜湊函式來實現。 優點:節省記憶體並允許資料的線上編碼(讀取完所有資料之前,你就可以立刻生成標記向量) 缺點:可能會出現雜湊衝突 如果雜湊空間的維度遠大於需要雜湊的唯一標記的個數,雜湊衝突的可能性會減小 |
|
import numpy as np samples = ['the cat sat on the mat the cat sat on the mat the cat sat on the mat','the dog ate my homowork'] dimensionality = 1000#將單詞儲存為1000維的向量 max_length = 10 results = np.zeros((len(samples),max_length,dimensionality)) for i,sample in enumerate(samples): for j,word in list(enumerate(sample.split()))[:max_length]: index = abs(hash(word)) % dimensionality results[i,j,index] = 1
|
|
2. 詞嵌入