【506】keras 讀取及處理 IMDB 資料庫
阿新 • • 發佈:2020-12-27
利用IMDB資料進行SentimentAnalysis。
通過keras.datasets裡面下載,注意下載的結構,並進行預處理。
from keras.datasets import imdb from keras import preprocessing # Number of words to consider as features max_features = 10000 # Cut texts after this number of words # (among top max_features most common words) maxlen = 20 # Load the data as lists of integers. (x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train
- type: numpy.ndarray
- shape: (25000, ),每一個文字的長度不同,需要補充 0或者擷取,保證長度相同
- 都是由數字組成,數字與單詞對應
y_train:二分類 0和 1
需要對文字長度進行調節
# This turns our lists of integers # into a 2D integer tensor of shape `(samples, maxlen)` x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen) x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)
長度設定為maxlen=20。
得到的矩陣可以直接作為Embedding層的輸入資料。
語法:
keras.preprocessing.sequence.pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.)
將長為nb_samples
的序列(標量序列)轉化為形如(nb_samples,nb_timesteps)
2D numpy array。如果提供了引數maxlen
,nb_timesteps=maxlen
nb_timesteps
的序列將會被截斷,以使其匹配目標長度。padding和截斷髮生的位置分別取決於padding
和truncating
.
引數:
-
sequences:浮點數或整數構成的兩層巢狀列表
-
maxlen:None或整數,為序列的最大長度。大於此長度的序列將被截短,小於此長度的序列將在後部填0.
-
dtype:返回的numpy array的資料型別
-
padding:‘pre’或‘post’,確定當需要補0時,在序列的起始還是結尾補
-
truncating:‘pre’或‘post’,確定當需要截斷序列時,從起始還是結尾截斷
-
value:浮點數,此值將在填充時代替預設的填充值0
返回值:
返回形如(nb_samples,nb_timesteps)
的2D張量
舉例:
>>> a = np.array([[2, 3], [3, 4, 6], [7, 8, 9, 10]]) >>> a array([list([2, 3]), list([3, 4, 6]), list([7, 8, 9, 10])], dtype=object) >>> import keras Using TensorFlow backend. >>> b = keras.preprocessing.sequence.pad_sequences(a, maxlen=10) >>> b array([[ 0, 0, 0, 0, 0, 0, 0, 0, 2, 3], [ 0, 0, 0, 0, 0, 0, 0, 3, 4, 6], [ 0, 0, 0, 0, 0, 0, 7, 8, 9, 10]]) >>> c = keras.preprocessing.sequence.pad_sequences(a, maxlen=10, padding='post') >>> c array([[ 2, 3, 0, 0, 0, 0, 0, 0, 0, 0], [ 3, 4, 6, 0, 0, 0, 0, 0, 0, 0], [ 7, 8, 9, 10, 0, 0, 0, 0, 0, 0]]) >>> d = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding='post') >>> d array([[ 2, 3, 0], [ 3, 4, 6], [ 8, 9, 10]]) >>> e = keras.preprocessing.sequence.pad_sequences(a, maxlen=3) >>> e array([[ 0, 2, 3], [ 3, 4, 6], [ 8, 9, 10]]) >>> f = keras.preprocessing.sequence.pad_sequences(a, maxlen=3, padding='post', truncating='post') >>> f array([[2, 3, 0], [3, 4, 6], [7, 8, 9]])