1. 程式人生 > 其它 >word processing in nlp with tensorflow

word processing in nlp with tensorflow

through tokenier and Serialization achieve word processing for train neuro network,i use some sample with tensorflow to introduce.

Preprocessing

Tokenizer

source code:https://github.com/keras-team/keras-preprocessing/blob/master/keras_preprocessing/text.py#L490-L519

some important functions and variables

  • init

  • def fit_on_texts(self, texts) #texts can be a string or a list of strings or a list of list of strings

  • self.word_index # the type of variance is dictonary, which contain a specific word subject to a unique index

  • self.index_word #r eserve the key and value of the word_index

sample

  import tensorflow as tf
  from tensorflow import keras
  # the package which can tokenizer
  from tensorflow.keras.preprocessing.text import Tokenizer
  '''
    transform the word into number
  '''
  sentences= ['i love my dog
', 'i love my cat','you love my dog!'] tokenizer = Tokenizer(num_words = 100) tokenizer.fit_on_texts(sentences) word_index = tokenizer.word_index print(word_index) # get the result {'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

Serialization

sample

sentences= ['i love my dog', 'i love my cat','you love my dog!','do you think my dog is amazing']
sequences = tokenizer.texts_to_sequences(sentences)
print(sequences)
 '''
   result is [[3, 1, 2, 4], [3, 1, 2, 5], [6, 1, 2, 4], [6, 2, 4]]
   which is not encoding for amazing, because it's not appear in fit texts
 '''

To solve this problem,we can set a oov in tokenizer to encode a word which not appear before.

tokenizer = Tokenizer(num_words = 100, oov_token = "<OOV>")
'''
    restart the code,we can get the result 
    [[4, 2, 3, 5], [4, 2, 3, 6], [7, 2, 3, 5], [1, 7, 1, 3, 5, 1, 1]]
'''

but each sequences has the different length of the series, it's difficult for train a neuro network,so we need make the sequnces has the same length.

from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_sequences = pad_sequences(sequences, 
                                 padding = 'post',   # right padding
                                 maxlen = 5,         # max len of senquence  
                                 truncating = 'post') # right cut
padded_sequences
'''
then we can get the result 
array([[5, 3, 2, 4, 0],
       [5, 3, 2, 7, 0],
       [6, 3, 2, 4, 0],
       [8, 6, 9, 2, 4]])
'''