【505】Using keras for word-level one-hot encoding

阿新 • • 發佈：2020-12-26

　　對於Embedding層使用的輸入，就是整數矩陣，並不是真正的one-hot向量，需要利用Tokenizer來實現。

1.Tokenizer

1.1 語法

keras.preprocessing.text.Tokenizer(num_words=None, 
                                   filters='!"#$%&()*+,-./:;<=>?@[\]^_`{|}~ ', 
                                   lower=True, 
                                   split=' ', 
                                   char_level=False, 
                                   oov_token=None, 
                                   document_count=0)

　　文字標記實用類。該類允許使用兩種方法向量化一個文字語料庫：將每個文字轉化為一個整數序列（每個整數都是詞典中標記的索引）；或者將其轉化為一個向量，其中每個標記的係數可以是二進位制值、詞頻、TF-IDF權重等。

1.2 引數說明

num_words: 需要保留的最大詞數，基於詞頻。只有最常出現的num_words詞會被保留。
filters: 一個字串，其中每個元素是一個將從文字中過濾掉的字元。預設值是所有標點符號，加上製表符和換行符，減去'字元。
lower: 布林值。是否將文字轉換為小寫。
split: 字串。按該字串切割文字。
char_level: 如果為 True，則每個字元都將被視為標記。

oov_token: 如果給出，它將被新增到 word_index 中，並用於在text_to_sequence呼叫期間替換詞彙表外的單詞。

　　預設情況下，刪除所有標點符號，將文字轉換為空格分隔的單詞序列（單詞可能包含'字元）。這些序列然後被分割成標記列表。然後它們將被索引或向量化。

　　0是不會被分配給任何單詞的保留索引。

1.3類方法

fit_on_texts(texts)
- texts：要用以訓練的文字列表
texts_to_sequences(texts)
- texts：待轉為序列的文字列表
- 返回值：序列的列表，列表中每個序列對應於一段輸入文字
- ```
[[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]
```
texts_to_sequences_generator(texts)
- 本函式是texts_to_sequences的生成器函式版
- texts：待轉為序列的文字列表
- 返回值：每次呼叫返回對應於一段輸入文字的序列
texts_to_matrix(texts, mode)：
- texts：待向量化的文字列表
- mode：‘binary’，‘count’，‘tfidf’，‘freq’之一，預設為‘binary’
- 返回值：形如(len(texts), nb_words)的numpy array
- ```
[[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]
```
fit_on_sequences(sequences):
- sequences：要用以訓練的序列列表
sequences_to_matrix(sequences):
- sequences：待向量化的序列列表
- mode：‘binary’，‘count’，‘tfidf’，‘freq’之一，預設為‘binary’
- 返回值：形如(len(sequences), nb_words)的numpy array

1.4屬性

word_counts:字典，將單詞（字串）對映為它們在訓練期間出現的次數。僅在呼叫fit_on_texts之後設定。
- ```
OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])
```
- len(word_counts)來計算單詞的個數
word_docs: 字典，將單詞（字串）對映為它們在訓練期間所出現的文件或文字的數量。僅在呼叫fit_on_texts之後設定。
word_index: 字典，將單詞（字串）對映為它們的排名或者索引。僅在呼叫fit_on_texts之後設定。
document_count: 整數。分詞器被訓練的文件（文字或者序列）數量。僅在呼叫fit_on_texts或fit_on_sequences之後設定。

1.5舉例

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

# We create a tokenizer, configured to only take
# into account the top-1000 most common words
tokenizer = Tokenizer(num_words=1000)
# This builds the word index
tokenizer.fit_on_texts(samples)

print("word_counts: \n", tokenizer.word_counts)
print("\ntotal words: \n", len(tokenizer.word_counts))

# This turns strings into lists of integer indices.
sequences = tokenizer.texts_to_sequences(samples)

print("\nsequences:\n", sequences)

# You could also directly get the one-hot binary representations.
# Note that other vectorization modes than one-hot encoding are supported!
one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')

print("\none_hot_results:\n", one_hot_results)

# This is how you can recover the word index that was computed
word_index = tokenizer.word_index

print("\nword_index:\n", word_index)
print('\nFound %s unique tokens.' % len(word_index))

　　outputs:

word_counts: 
 OrderedDict([('the', 3), ('cat', 1), ('sat', 1), ('on', 1), ('mat', 1), ('dog', 1), ('ate', 1), ('my', 1), ('homework', 1)])

total words: 
 9

sequences:
 [[1, 2, 3, 4, 1, 5], [1, 6, 7, 8, 9]]

one_hot_results:
 [[0. 1. 1. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]]

word_index:
 {'the': 1, 'cat': 2, 'sat': 3, 'on': 4, 'mat': 5, 'dog': 6, 'ate': 7, 'my': 8, 'homework': 9}

Found 9 unique tokens.

【505】Using keras for word-level one-hot encoding

參考：Text Preprocessing —— Tokenizer 參考：Preprocessing » 文字預處理　　對於Embedding層使用的輸入，就是整數矩陣，並不是真正的one-hot向量，需要利用Tokenizer來實現。

【CodeForces219D】Choosing Capital for Treeland

題目連結 Choosing Capital for Treeland 題目描述 The country Treeland consists of n cities, some pairs of them are connected with unidirectional roads. Overall there are \\(n-1\\)roads in the country.