keras\preprocessing目錄檔案詳解5.2（sequence.py）-keras學習筆記五

阿新 • • 發佈：2019-01-06

功能：用於預處理序列（例如一篇文章，句子）資料的實用工具。

keras-master\keras\preprocessing\sequence.py

建立詞向量嵌入層，把輸入文字轉為可以進一步處理的資料格式（例如，矩陣）

程式碼註釋

# -*- coding: utf-8 -*-
"""Utilities for preprocessing sequence data.
用於預處理序列資料的實用工具。
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import random
from six.moves import range


def pad_sequences(sequences, maxlen=None, dtype='int32',
                  padding='pre', truncating='pre', value=0.):
    """Pads each sequence to the same length (length of the longest sequence).
    填充使得每個序列都具有相同的長度（最長序列的長度）。

    If maxlen is provided, any sequence longer
    than maxlen is truncated to maxlen.
    如果提供了maxlen（最大長度），則任何比如果提供了maxlen長的序列都被截斷到maxlen（長度）。
    Truncation happens off either the beginning (default) or
    the end of the sequence.
    截斷髮生在開始（預設）或序列結束時。

    Supports post-padding and pre-padding (default).
    支援後置填充和預填充（預設）。

    # Arguments
    引數
        sequences: list of lists where each element is a sequence
        sequences: 每個元素是序列的列表（列表中的每個元素是一個列表）。
        maxlen: int, maximum length
        maxlen: 整型，最大長度
        dtype: type to cast the resulting sequence.
        dtype: 生成結果序列的型別。
        padding: 'pre' or 'post', pad either before or after each sequence.
        padding: 前或後，在每個序列的前或後填充。
        truncating: 'pre' or 'post', remove values from sequences larger than
            maxlen either in the beginning or in the end of the sequence
        truncating: 前或後，在序列開始或結束時從大於maxlen的序列中移除值
        value: float, value to pad the sequences to the desired value.
        value: 浮點型，值將序列填充到期望值。

    # Returns
    返回
        x: numpy array with dimensions (number_of_sequences, maxlen)
        x: numpy陣列，維度為 (number_of_sequences, maxlen)  ，其中number_of_sequences為序列數量，maxlen序列最大長度

    # Raises
    補充
        ValueError: in case of invalid values for `truncating` or `padding`,
            or in case of invalid shape for a `sequences` entry.
        ValueError: 在“truncating”或“padding”的無效值的情況下，或者對於“sequences”條目無效的形狀。
    """
    if not hasattr(sequences, '__len__'):
        raise ValueError('`sequences` must be iterable.')
    lengths = []
    for x in sequences:
        if not hasattr(x, '__len__'):
            raise ValueError('`sequences` must be a list of iterables. '
                             'Found non-iterable: ' + str(x))
        lengths.append(len(x))

    num_samples = len(sequences)
    if maxlen is None:
        maxlen = np.max(lengths)

    # take the sample shape from the first non empty sequence
    # checking for consistency in the main loop below.
    # 從第一個非空序列檢查中獲取樣本形狀，以便在下面的主迴圈中獲得一致性。
    sample_shape = tuple()
    for s in sequences:
        if len(s) > 0:
            sample_shape = np.asarray(s).shape[1:]
            break

    x = (np.ones((num_samples, maxlen) + sample_shape) * value).astype(dtype)
    for idx, s in enumerate(sequences):
        if not len(s):
            continue  # empty list/array was found
        if truncating == 'pre':
            trunc = s[-maxlen:]
        elif truncating == 'post':
            trunc = s[:maxlen]
        else:
            raise ValueError('Truncating type "%s" not understood' % truncating)

        # check `trunc` has expected shape
        # 檢查“trunc”是否具有預期形狀
        trunc = np.asarray(trunc, dtype=dtype)
        if trunc.shape[1:] != sample_shape:
            raise ValueError('Shape of sample %s of sequence at position %s is different from expected shape %s' %
                             (trunc.shape[1:], idx, sample_shape))

        if padding == 'post':
            x[idx, :len(trunc)] = trunc
        elif padding == 'pre':
            x[idx, -len(trunc):] = trunc
        else:
            raise ValueError('Padding type "%s" not understood' % padding)
    return x


def make_sampling_table(size, sampling_factor=1e-5):
    """Generates a word rank-based probabilistic sampling table.
    生成基於詞秩的概率抽樣表。

    This generates an array where the ith element
    is the probability that a word of rank i would be sampled,
    according to the sampling distribution used in word2vec.
    這就產生了一個數組，其中第i個元素是根據word2vec中使用的取樣分佈來對秩i進行取樣的概率。

    The word2vec formula is:
    word2vec公式為：
        p(word) = min(1, sqrt(word.frequency/sampling_factor) / (word.frequency/sampling_factor))

    We assume that the word frequencies follow Zipf's law (s=1) to derive
    我們假設詞頻遵循Zipf定律（s＝1）來推導。
    a numerical approximation of frequency(rank):
    頻率（秩）的數值逼近：
       frequency(rank) ~ 1/(rank * (log(rank) + gamma) + 1/2 - 1/(12*rank))
        where gamma is the Euler-Mascheroni constant.
        其中Gamma是Euler-Mascheroni常數。

     Zipf's law(齊夫定律):https://en.wikipedia.org/wiki/Zipf%27s_law
     https://www.cnblogs.com/sddai/p/6081447.html


    # Arguments
    引數
        size: int, number of possible words to sample.
        size: 整型，可能的取樣單詞數。
        sampling_factor: the sampling factor in the word2vec formula.
        sampling_factor: word2vec公式中的取樣因子。

    # Returns
    返回
        A 1D Numpy array of length `size` where the ith entry
        is the probability that a word of rank i should be sampled.
        長度為“size”的一維Numpy陣列，其中第i個條目是應該對等級I進行取樣的概率。
    """
    gamma = 0.577
    rank = np.arange(size)
    rank[0] = 1
    inv_fq = rank * (np.log(rank) + gamma) + 0.5 - 1. / (12. * rank)
    f = sampling_factor * inv_fq

    return np.minimum(1., f / np.sqrt(f))


def skipgrams(sequence, vocabulary_size,
              window_size=4, negative_samples=1., shuffle=True,
              categorical=False, sampling_table=None, seed=None):
    """Generates skipgram word pairs.
    生成skipgram單詞對。
    skipgram：https://blog.csdn.net/u010665216/article/details/78721354?locationNum=7&fps=1

    Takes a sequence (list of indexes of words),
    returns couples of [word_index, other_word index] and labels (1s or 0s),
    where label = 1 if 'other_word' belongs to the context of 'word',
    and label=0 if 'other_word' is randomly sampled
    取一個序列（單詞索引的列表），返回[word_index, other_word index]和標籤（1s或0）的對，其中標籤label = 1如
    果 'other_word' 屬於'word'的上下文，同時標籤label=0，如果'other_word'是隨機抽樣的。

    # Arguments
    引數
        sequence: a word sequence (sentence), encoded as a list
            of word indices (integers). If using a `sampling_table`,
            word indices are expected to match the rank
            of the words in a reference dataset (e.g. 10 would encode
            the 10-th most frequently occurring token).
            Note that index 0 is expected to be a non-word and will be skipped.
        sequence:一個單詞序列（句子），被編碼為單詞索引（整數）的列表。如果使用“sampling_table”，則期
        望單詞索引與參考資料集中的單詞的等級相匹配（例如，10將編碼第10個最頻繁出現的分詞）。
        注意，索引0預期為非單詞，將被跳過。
        vocabulary_size: int. maximum possible word index + 1
        vocabulary_size: 整型。最大（值）可能是  word index + 1  （第一個詞索引是0）
        window_size: int. actually half-window.
            The window of a word wi will be [i-window_size, i+window_size+1]
        window_size:整型。實際上是半視窗。
            一個單詞Wi的視窗將是 [i-window_size, i+window_size+1]。
        negative_samples: float >= 0. 0 for no negative (=random) samples.
            1 for same number as positive samples. etc.
        negative_samples: 浮點數 >= 0。 0表示沒有負（隨機）樣本。1表示和正樣本相同數量。
        shuffle: whether to shuffle the word couples before returning them.
        shuffle: 在返回之前，是否重新整理（排序）詞對。
        categorical: bool. if False, labels will be
            integers (eg. [0, 1, 1 .. ]),
            if True labels will be categorical eg. [[1,0],[0,1],[0,1] .. ]
        sampling_table: 1D array of size `vocabulary_size` where the entry i
            encodes the probability to sample a word of rank i.
        sampling_table:  `vocabulary_size` 大小的一維陣列，其中條目i編碼i等級詞的取樣概率。
        seed: random seed.
        seed: 隨機種子

    # Returns
    返回
        couples, labels: where `couples` are int pairs and
            `labels` are either 0 or 1.
        couples, labels:`couples`是整數對，`labels`是 0 或者 1。

    # Note
    注意
        By convention, index 0 in the vocabulary is
        a non-word and will be skipped.
        按照慣例，詞彙表中的索引0是非單詞，將被跳過。
    """
    couples = []
    labels = []
    for i, wi in enumerate(sequence):
        if not wi:
            continue
        if sampling_table is not None:
            if sampling_table[wi] < random.random():
                continue

        window_start = max(0, i - window_size)
        window_end = min(len(sequence), i + window_size + 1)
        for j in range(window_start, window_end):
            if j != i:
                wj = sequence[j]
                if not wj:
                    continue
                couples.append([wi, wj])
                if categorical:
                    labels.append([0, 1])
                else:
                    labels.append(1)

    if negative_samples > 0:
        num_negative_samples = int(len(labels) * negative_samples)
        words = [c[0] for c in couples]
        random.shuffle(words)

        couples += [[words[i % len(words)],
                    random.randint(1, vocabulary_size - 1)] for i in range(num_negative_samples)]
        if categorical:
            labels += [[1, 0]] * num_negative_samples
        else:
            labels += [0] * num_negative_samples

    if shuffle:
        if seed is None:
            seed = random.randint(0, 10e6)
        random.seed(seed)
        random.shuffle(couples)
        random.seed(seed)
        random.shuffle(labels)

    return couples, labels


def _remove_long_seq(maxlen, seq, label):
    """Removes sequences that exceed the maximum length.
    移除超過最大長度的序列。

    # Arguments
    引數
        maxlen: int, maximum length
        maxlen: 整型，最大的長度
        seq: list of lists where each sublist is a sequence
        seq: 每個子列表是序列的序列列表
        label: list where each element is an integer
        label: 每個元素是整數的列表

    # Returns
    返回
        new_seq, new_label: shortened lists for `seq` and `label`.
        new_seq, new_label: `seq` 和 `label`.的縮短列表。
    """
    new_seq, new_label = [], []
    for x, y in zip(seq, label):
        if len(x) < maxlen:
            new_seq.append(x)
            new_label.append(y)
    return new_seq, new_label

程式碼執行

Keras詳細介紹

例項下載

完整專案下載

方便沒積分童鞋，請加企鵝452205574，共享資料夾。

包括：程式碼、資料集合（圖片）、已生成model、安裝庫檔案等。

keras\preprocessing目錄檔案詳解5.2（sequence.py）-keras學習筆記五

功能：用於預處理序列（例如一篇文章，句子）資料的實用工具。 keras-master\keras\preprocessing\sequence.py 建立詞向量嵌入層，把輸入文字轉為可以進一步處理的資料格式（例如，矩陣）程式碼註釋 # -*- coding:

官網例項詳解4.15（imdb_cnn_lstm.py）-keras學習筆記四

程式碼註釋'''Train a recurrent convolutional network on the IMDB sentiment classification task. 為IMDB（資料集）情感分類任務訓練迴圈卷積網路 Gets to 0.8498 test ac

官網例項詳解4.6（cifar10_cnn_capsule.py）-keras學習筆記四

基於CIFAR10（小批量圖片）資料集訓練簡單的膠囊（組神經元）深度卷積神經網路程式碼註釋"""Train a simple CNN-Capsule Network on the CIFAR10 small images dataset. 基於CIFAR10（小批量圖片）資料

官網例項詳解4.30（mnist_siamese.py）-keras學習筆記四

基於MNIST資料集上從一對數字中訓練一個 Siamese MLP。Siamese ,連體的，相似的。Siamese Net,孿生網路、連體網路MLP，多層感知機，（多個隱藏層的全連線的神經網路）詳解程式碼註釋'''Trains a Siamese MLP on pairs

Linux 下 etc/ 目錄檔案詳解

3. 網路配置檔案3.1 /etc/hosts#/etc/hosts#檔案格式: IPaddress hostname aliases#檔案功能: 提供主機名到IP地址的對應關係，建議將自己經常使用的主機# 加入此檔案中，也可將沒有DNS記錄的機器加入到此檔案中，# 這樣會方

LwIP 之原始碼目錄檔案詳解及移植說明

原始碼目錄檔案目前，網路上多數文章所使用的LwIP版本為1.4.1。最新版本為2.0.3。從1.4.1到2.0.3（貌似從2.0.0開始），LwIP的原始碼有了一定的變化，甚至於原始碼的檔案結構也不一樣，內部的一些實現原始檔也被更新和替換了。其原始碼目錄結

機器學習中的概率模型和概率密度估計方法及VAE生成式模型詳解之二（作者簡介）

AR aca rtu href beijing cert school start ica Brief Introduction of the AuthorChief Architect at 2Wave Technology Inc. (a startup company

linux命令詳解之df（6/19）

AS true 設置符號鏈接 disk var aci 實例 logs df命令作用是列出文件系統的整體磁盤空間使用情況。可以用來查看磁盤已被使用多少空間和還剩余多少空間。 df命令顯示系統中包含每個文件名參數的磁盤使用情況，如果沒有文件名參數，則顯示所有當前已掛載文件系

Hibernate--詳解Query物件（分頁）

package com.itheima.b_api; import java.util.Arrays; import java.util.List; import org.hibernate.Criteria; import org.hibernate.Query; import org.h

分散式系統詳解--基礎知識（執行緒）

分散式系統詳解--基礎知識（執行緒）一、導讀前面跟大家講了一下&n

分散式系統詳解--架構簡介（微服務）

分散式系統詳解--架構簡介（微服務）前面的一個集合我們

Java_51_組合_內部類詳解_字串（String類）_equals和==的區別

組合使用組合，可以獲得更多的靈活性，你甚至可以在執行的時候才決定哪幾個類組合在一起。使用繼承，他是一種高度耦合，派生類和基類被緊緊的綁在一起，靈活性大大降低，而且，濫用繼承，也會使繼承樹變得又大又複雜，很難理解和維護。如果是is-a關係，用繼承。【是一個[物件]】如果是h

詳解遞迴（基礎篇）———函式棧、階乘、Fibonacci數列

一、遞迴的基本概念遞迴函式:在定義的時候，自己呼叫了自己的函式。注意：遞迴函式定義的時候一定要明確結束這個函式的條件！二、函式棧棧:一種資料結構，它僅允許棧頂進，棧頂出，先進後出，後進先出。我們可以簡單的理解為棧就是一個杯子，這個杯子裡面有很多隔層，每一層都可以放東西，第一個放入的東西就在杯子

Elasticsearch深入詳解-知識圖譜（每週更新）

1、題記 Elasticsearch技術已經燃爆到飛的感覺。為了方便訂閱Elasticsearch深入詳解的博友們第一時間獲取最新經驗分享，和大家一起成長，特將本專欄內容製作為Elasticsearch深入詳解知識圖譜。並承諾【銘毅天下】微信公眾號每

FreeRTOS之全配置項詳解、裁剪（FreeRTOSConfig.h）

簡介首先，我們需要明確一個問題，FreeRTOSConfig.h是一個使用者級別的檔案，不屬於核心檔案。每個使用者可以有不同的FreeRTOSConfig.h。 FreeRTOS作為一個可高度配置的實時核心，其絕大多數配置選項都體現在FreeRTO

《Python程式設計從入門到實踐》學習筆記詳解-專案篇（下載資料）

上兩篇文章分別介紹了《Python程式設計從入門到實踐》的語法篇和專案篇（資料視覺化），這篇文 #專案二下載資料 #訪問並可視化csv和json這兩種常見格式儲存的資料 #csv #提取並讀取資料 import csv filename='filename.c

PyQt5基本控制元件詳解之QDialog（十二）

QDialog 前言為了更好的實現人機互動，比如window和linux等系統均會提供一系列的標準對話方塊來完成特定場景下的功能，比如選擇字號大小。字型顏色等，在PyQt5中定義了一系列的標準對話方塊類，讓使用者能夠方便快捷地通過各個類完成字號大

資料結構之折半插入排序圖文詳解及程式碼（C++實現）

問題：對待排序的陣列r[1..n]中的元素進行直接插入排序，得到一個有序的（從小到大）的陣列r[1..n]。演算法思想：1、設待排序的記錄存放在陣列r[1..n]中，r[1]是一個有序序列。2、迴圈n-1次，每次使用折半查詢法，查詢r[i]（i=2，..，n）在已排好的序列r

java Comparable 和Comparator詳解及區別（附程式碼）

java中，對集合物件或者陣列物件排序，有兩種實現方式。即：（1）物件實現Comparable 介面（2）定義比較器，實現Comparator介面。下面會簡要介紹這兩種方法的區別，並附上實現程式碼，供大家參考。 Comparable介紹 Compar

keras\preprocessing目錄檔案詳解5.2（sequence.py）-keras學習筆記五

相關推薦