word processing in nlp with tensorflow
Tokenizer
source code:https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py#L490-L519
some important functions and variables
-
-
def fit_on_texts(self, texts) #texts can be a string or a list of strings or a list of list of strings
-
self.word_index # the type of variance is dictonary, which contain a specific word subject to a unique index
-
self.index_word #r eserve the key and value of the word_index
sample
import tensorflow as tf from tensorflow import keras # the package which can tokenizer from tensorflow.keras.preprocessing.text import Tokenizer ''' transform the word into number ''' sentences= ['i love my dog', 'i love my cat','you love my dog!'] tokenizer = Tokenizer(num_words = 100) tokenizer.fit_on_texts(sentences) word_index = tokenizer.word_index print(word_index) # get the result {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}
Serialization
-
texts_to_sequences(self,texts) # transforms each text in texts to a sequence of integers.
-
tf.keras.preprocessing.sequence.pad_sequences( sequences, maxlen=None, dtype='int32',padding='pre', truncating='pre', value=0.) # make the sentences with same length.
sample
sentences= ['i love my dog', 'i love my cat','you love my dog!','do you think my dog is amazing'] sequences = tokenizer.texts_to_sequences(sentences) print(sequences) ''' result is [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4], [6, 2, 4]] which is not encoding for amazing, because it's not appear in fit texts '''
To solve this problem,we can set a oov
in tokenizer to encode a word which not appear before.
tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>") ''' restart the code,we can get the result [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5], [1, 7, 1, 3, 5, 1, 1]] '''
but each sequences has the different length of the series, it's difficult for train a neuro network,so we need make the sequnces has the same length.
from tensorflow.keras.preprocessing.sequence import pad_sequences padded_sequences = pad_sequences(sequences, padding = 'post', # right padding maxlen = 5, # max len of senquence truncating = 'post') # right cut padded_sequences ''' then we can get the result array([[5, 3, 2, 4, 0], [5, 3, 2, 7, 0], [6, 3, 2, 4, 0], [8, 6, 9, 2, 4]]) '''