keras\preprocessing目錄檔案詳解5.2(sequence.py)-keras學習筆記五
阿新 • • 發佈:2019-01-06
功能:用於預處理序列(例如一篇文章,句子)資料的實用工具。
keras-master\keras\preprocessing\sequence.py
建立詞向量嵌入層,把輸入文字轉為可以進一步處理的資料格式(例如,矩陣)
程式碼註釋
# -*- coding: utf-8 -*- """Utilities for preprocessing sequence data. 用於預處理序列資料的實用工具。 """ from __future__ import absolute_import from __future__ import division from __future__ import print_function import numpy as np import random from six.moves import range def pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.): """Pads each sequence to the same length (length of the longest sequence). 填充使得每個序列都具有相同的長度(最長序列的長度)。 If maxlen is provided, any sequence longer than maxlen is truncated to maxlen. 如果提供了maxlen(最大長度),則任何比如果提供了maxlen長的序列都被截斷到maxlen(長度)。 Truncation happens off either the beginning (default) or the end of the sequence. 截斷髮生在開始(預設)或序列結束時。 Supports post-padding and pre-padding (default). 支援後置填充和預填充(預設)。 # Arguments 引數 sequences: list of lists where each element is a sequence sequences: 每個元素是序列的列表(列表中的每個元素是一個列表)。 maxlen: int, maximum length maxlen: 整型,最大長度 dtype: type to cast the resulting sequence. dtype: 生成結果序列的型別。 padding: 'pre' or 'post', pad either before or after each sequence. padding: 前或後,在每個序列的前或後填充。 truncating: 'pre' or 'post', remove values from sequences larger than maxlen either in the beginning or in the end of the sequence truncating: 前或後,在序列開始或結束時從大於maxlen的序列中移除值 value: float, value to pad the sequences to the desired value. value: 浮點型,值將序列填充到期望值。 # Returns 返回 x: numpy array with dimensions (number_of_sequences, maxlen) x: numpy陣列,維度為 (number_of_sequences, maxlen) ,其中number_of_sequences為序列數量,maxlen序列最大長度 # Raises 補充 ValueError: in case of invalid values for `truncating` or `padding`, or in case of invalid shape for a `sequences` entry. ValueError: 在“truncating”或“padding”的無效值的情況下,或者對於“sequences”條目無效的形狀。 """ if not hasattr(sequences, '__len__'): raise ValueError('`sequences` must be iterable.') lengths = [] for x in sequences: if not hasattr(x, '__len__'): raise ValueError('`sequences` must be a list of iterables. ' 'Found non-iterable: ' + str(x)) lengths.append(len(x)) num_samples = len(sequences) if maxlen is None: maxlen = np.max(lengths) # take the sample shape from the first non empty sequence # checking for consistency in the main loop below. # 從第一個非空序列檢查中獲取樣本形狀,以便在下面的主迴圈中獲得一致性。 sample_shape = tuple() for s in sequences: if len(s) > 0: sample_shape = np.asarray(s).shape[1:] break x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype) for idx, s in enumerate(sequences): if not len(s): continue # empty list/array was found if truncating == 'pre': trunc = s[-maxlen:] elif truncating == 'post': trunc = s[:maxlen] else: raise ValueError('Truncating type "%s" not understood' % truncating) # check `trunc` has expected shape # 檢查“trunc”是否具有預期形狀 trunc = np.asarray(trunc, dtype=dtype) if trunc.shape[1:] != sample_shape: raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' % (trunc.shape[1:], idx, sample_shape)) if padding == 'post': x[idx, :len(trunc)] = trunc elif padding == 'pre': x[idx, -len(trunc):] = trunc else: raise ValueError('Padding type "%s" not understood' % padding) return x def make_sampling_table(size, sampling_factor=1e-5): """Generates a word rank-based probabilistic sampling table. 生成基於詞秩的概率抽樣表。 This generates an array where the ith element is the probability that a word of rank i would be sampled, according to the sampling distribution used in word2vec. 這就產生了一個數組,其中第i個元素是根據word2vec中使用的取樣分佈來對秩i進行取樣的概率。 The word2vec formula is: word2vec公式為: p(word) = min(1, sqrt(word.frequency/sampling_factor) / (word.frequency/sampling_factor)) We assume that the word frequencies follow Zipf's law (s=1) to derive 我們假設詞頻遵循Zipf定律(s=1)來推導。 a numerical approximation of frequency(rank): 頻率(秩)的數值逼近: frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank)) where gamma is the Euler-Mascheroni constant. 其中Gamma是Euler-Mascheroni常數。 Zipf's law(齊夫定律):https://en.wikipedia.org/wiki/Zipf%27s_law https://www.cnblogs.com/sddai/p/6081447.html # Arguments 引數 size: int, number of possible words to sample. size: 整型,可能的取樣單詞數。 sampling_factor: the sampling factor in the word2vec formula. sampling_factor: word2vec公式中的取樣因子。 # Returns 返回 A 1D Numpy array of length `size` where the ith entry is the probability that a word of rank i should be sampled. 長度為“size”的一維Numpy陣列,其中第i個條目是應該對等級I進行取樣的概率。 """ gamma = 0.577 rank = np.arange(size) rank[0] = 1 inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank) f = sampling_factor * inv_fq return np.minimum(1., f / np.sqrt(f)) def skipgrams(sequence, vocabulary_size, window_size=4, negative_samples=1., shuffle=True, categorical=False, sampling_table=None, seed=None): """Generates skipgram word pairs. 生成skipgram單詞對。 skipgram:https://blog.csdn.net/u010665216/article/details/78721354?locationNum=7&fps=1 Takes a sequence (list of indexes of words), returns couples of [word_index, other_word index] and labels (1s or 0s), where label = 1 if 'other_word' belongs to the context of 'word', and label=0 if 'other_word' is randomly sampled 取一個序列(單詞索引的列表),返回[word_index, other_word index]和標籤(1s或0)的對,其中標籤label = 1如 果 'other_word' 屬於'word'的上下文,同時標籤label=0,如果'other_word'是隨機抽樣的。 # Arguments 引數 sequence: a word sequence (sentence), encoded as a list of word indices (integers). If using a `sampling_table`, word indices are expected to match the rank of the words in a reference dataset (e.g. 10 would encode the 10-th most frequently occurring token). Note that index 0 is expected to be a non-word and will be skipped. sequence:一個單詞序列(句子),被編碼為單詞索引(整數)的列表。如果使用“sampling_table”,則期 望單詞索引與參考資料集中的單詞的等級相匹配(例如,10將編碼第10個最頻繁出現的分詞)。 注意,索引0預期為非單詞,將被跳過。 vocabulary_size: int. maximum possible word index + 1 vocabulary_size: 整型。最大(值)可能是 word index + 1 (第一個詞索引是0) window_size: int. actually half-window. The window of a word wi will be [i-window_size, i+window_size+1] window_size:整型。實際上是半視窗。 一個單詞Wi的視窗將是 [i-window_size, i+window_size+1]。 negative_samples: float >= 0. 0 for no negative (=random) samples. 1 for same number as positive samples. etc. negative_samples: 浮點數 >= 0。 0表示沒有負(隨機)樣本。1表示和正樣本相同數量。 shuffle: whether to shuffle the word couples before returning them. shuffle: 在返回之前,是否重新整理(排序)詞對。 categorical: bool. if False, labels will be integers (eg. [0, 1, 1 .. ]), if True labels will be categorical eg. [[1,0],[0,1],[0,1] .. ] sampling_table: 1D array of size `vocabulary_size` where the entry i encodes the probability to sample a word of rank i. sampling_table: `vocabulary_size` 大小的一維陣列,其中條目i編碼i等級詞的取樣概率。 seed: random seed. seed: 隨機種子 # Returns 返回 couples, labels: where `couples` are int pairs and `labels` are either 0 or 1. couples, labels:`couples`是整數對,`labels`是 0 或者 1。 # Note 注意 By convention, index 0 in the vocabulary is a non-word and will be skipped. 按照慣例,詞彙表中的索引0是非單詞,將被跳過。 """ couples = [] labels = [] for i, wi in enumerate(sequence): if not wi: continue if sampling_table is not None: if sampling_table[wi] < random.random(): continue window_start = max(0, i - window_size) window_end = min(len(sequence), i + window_size + 1) for j in range(window_start, window_end): if j != i: wj = sequence[j] if not wj: continue couples.append([wi, wj]) if categorical: labels.append([0, 1]) else: labels.append(1) if negative_samples > 0: num_negative_samples = int(len(labels) * negative_samples) words = [c[0] for c in couples] random.shuffle(words) couples += [[words[i % len(words)], random.randint(1, vocabulary_size - 1)] for i in range(num_negative_samples)] if categorical: labels += [[1, 0]] * num_negative_samples else: labels += [0] * num_negative_samples if shuffle: if seed is None: seed = random.randint(0, 10e6) random.seed(seed) random.shuffle(couples) random.seed(seed) random.shuffle(labels) return couples, labels def _remove_long_seq(maxlen, seq, label): """Removes sequences that exceed the maximum length. 移除超過最大長度的序列。 # Arguments 引數 maxlen: int, maximum length maxlen: 整型,最大的長度 seq: list of lists where each sublist is a sequence seq: 每個子列表是序列的序列列表 label: list where each element is an integer label: 每個元素是整數的列表 # Returns 返回 new_seq, new_label: shortened lists for `seq` and `label`. new_seq, new_label: `seq` 和 `label`.的縮短列表。 """ new_seq, new_label = [], [] for x, y in zip(seq, label): if len(x) < maxlen: new_seq.append(x) new_label.append(y) return new_seq, new_label
程式碼執行
Keras詳細介紹
例項下載
完整專案下載
方便沒積分童鞋,請加企鵝452205574,共享資料夾。
包括:程式碼、資料集合(圖片)、已生成model、安裝庫檔案等。